Statistics (Part 7) – Magic Density

Sometimes the optimizer has absolutely no idea how to estimate cardinality. There are either no statistics and auto-create statistics has been disabled or the predicate isn’t of a form that SQL Server can use any of the existing statistics. There was an example of the latter problem in my last blog post about stats on computed columns. On these occasions the optimizer must be like Spock.

SPOCK
In order to return us to the exact moment at which we left the 23rd Century, I have used our journey back through time as a referent, calculating the coefficient of elapsed time in relation to the acceleration curve.

BONES
Naturally.

SPOCK
Acceleration is no longer a constant.

BONES
Well, you’re gonna have to take your best shot.

SPOCK
Best shot?

BONES

SPOCK
Guessing is not in my nature.

BONES
Well nobody’s perfect.

Unlike Mr. Spock guessing is definitely in the nature of the optimizer and when it has no other way to estimate cardinality it will simply guess. Well, it is always guessing but most of the time the guess is educated, informed by stats. Without information to go by the guesses are so unbelievably bad you’d think you’d stumbled into a screening of Highlander 2: The Quickening. There are some rules it has to go by but ultimately it is taking completely uneducated guesses. However I’ve found that knowing how it is going to guess has been helpful in discovering that the optimizer is guessing. To my knowledge there is no flag or other notification that it has taken a complete WAG. And knowing that it is guessing helps track down why we have crappy query plans and points us in a direction where we might be able to fix them.

For years I’ve read that if the optimizer doesn’t have statistics to go on it will estimate 10% of the rows in the table for straight equality searches.
Consider this query.

SELECT * FROM dbo.StatTest WHERE Col1 = 'A'

On a table with 12,000 rows this would be estimated at 1,200 rows returned. In preparation for this series of blog posts I tested everything I was asserting before putting it on the page. The only time I could come up with a 10% of row count cardinality estimate was with 10,000 rows on the table. I also could not figure out what percentage was being used. If I changed the table size the percentage changed. At that point I suspected that the cardinality estimate was logarithmic in nature and changed based on the size of the base table.

So I created many test tables with each with a different number of rows, 1000, 2000, 3000, 4000, 5000, 10000, 11000, etc. I plotting the numbers of rows estimated on a graph and, with the help of someone who knows what he is doing, was able to figure out the formula that fit the curve.

It turns out that the instead of 10% the estimate is ROWS ^ .75. This explained why my 10000 row table gave me a 10% estimate (10000^.75 = 1000 = 10000 * .10.) It was pure coincidence.

I then tested a query whose predicate queried against two columns.

SELECT * FROM dbo.StatTest WHERE Col1 = 'A' AND Col2 = 'B'

I expected to use the same formula but it didn’t work. Plotting the numbers on a graph showed that the formula is ROWS ^ .6875.

I was about to give up when I ran across this blog post from Ian Jose which showed the factor constant used for cardinality estimates with a different number of equality matches in the predicate. I’m repeating the values here in case that page ever goes away.

1 column – Rows ^ (3/4) = Rows ^ .75
2 column – Rows ^ (11/16) = Rows ^ .6875
3 column – Rows ^ (43/64)
4 column – Rows ^ (171/256)
5 column – Rows ^ (170/256)

175 column – Rows ^ (0/256)

I tested this out to four column equality comparisons and decided that was good enough and I would believe Ian’s numbers.
For other situations the magic densities I’ve read about all were correct. BETWEEN uses 9% of row count and <, >, <=, =>, use 30% of row count.

Ultimately if the optimizer is using magic densities it is a good idea to figure that out and create the stats or fix the code that allow it to make educated guesses. Your instance will love you for it.

Statistics (Part 6) – Computed Columns

This topic is unbelievable dry. Dryer than ancient Egyptian pharaoh bone dust dry. Dryer than the surface of the sun dry. The layman might say “you’re talking about SQL Server and you expect it to be interesting?” Fair point. Maybe the topic of SQL Server is dry to the laymen but almost every topic is interesting to someone. Some people collect thimbles or bugs. Some people juggle geese. And as interested in SQL Server as I am this particular aspect of stats has got to be the driest I’ve ever written about. Death Valley dry. Dryer than Aunt Edna’s thanksgiving turkey dry.

However I feel it is necessary to include this information as part of the series. Please pardon me if I try to spice it up a bit. I already shorted out one keyboard from the boredom induced drool. This builds on my previous and hopefully less boring series on stats.

So…stats on computed columns. You can do this. Even if the column isn’t persisted. There is a caveat though. The formula for the column must be deterministic.

What does it mean to be deterministic? Wikipedia defines determinism as “a metaphysical philosophy stating that for everything that happens there are conditions such that, given those conditions, nothing else could happen.” In computer science this means that given the same input you will always have the same output. In a deterministic universe if we went back 14 billion years and reran the big bang I’d still be typing this boring topic and you would have still stopped reading this two paragraphs ago (see what I did there?)

It would be impossible to generate statistics on a column that is created from a non-deterministic expression. Imagine a column that was based on the result of RAND(). Every select would produce a new value on each row. Stats would be wrong the minute they are created. So SQL Server won’t let you create stats on non-deterministic computed columns.

However given the predictability of determinism (this is actually the third time you’ve read this in the last 42 billion years you just don’t remember the first two times) SQL Server is easily able to create statistics on a computed column, even if the computed column isn’t persisted. If the computed column C is the sum of column A and column B then the computed column will always be A + B even if C is never saved on disk.

According to the documentation the optimizer can make estimates even if the predicate contains the formula the computed column is calculated from instead of the column name. I will test that too. Fun.
Continue reading Statistics (Part 6) – Computed Columns

Statistics (Part 5) – String Index

I’ve been talking about different aspects of stats and cardinality estimates. My last post is hiding here.

The String Index portion of a statistics object is sparsely documented. The existence of the string index in a stats object can be discovered by looking at the header output of DBCC SHOW_STATISTICS.

The documentation from Microsoft often refers to this as the String Summary.

The string index assists in cardinality estimates for substrings in predicates containing the LIKE clause with wildcard characters. It only does this for the leading column of a stats object and only for character data types, i.e. the string index is constructed when the leading column of a stats object is a character column.

The index contains substrings from the first and last 40 characters of a string. If the string is 80 characters long then the string index is created for all characters. If the string is 120 characters long the middle 40 characters would not be included in the string index.

To show this I created a test table with column that is 120 characters long. This column can logically be divided into three sections, the first 40 characters, the middle 40 characters, and the last 40 characters. In each of these sections I create a substring that is unique to that section and does not exist in the other sections. By querying the table with LIKE and wildcards I can see if the cardinality estimates for the middle section are different from the estimates for the beginning and ending sections. In theory the cardinality estimate for the middle section should be the same for predicate whose substring is not part of the string index. I also include several thousand rows that have no noteable substrings at all just to give the table soem mass.

CREATE TABLE StringTest (ID int IDENTITY(1,1) PRIMARY KEY NOT NULL, Col1 varchar(120))
INSERT INTO StringTest (Col1) SELECT REPLICATE('A',120)
GO 1000
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',10) +
REPLICATE('B',10) +
REPLICATE('A',100)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',10) +
REPLICATE('C',10) +
REPLICATE('A',100)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',120)
GO 1000
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',50) +
REPLICATE('D',10) +
REPLICATE('A',60)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',50) +
REPLICATE('E',10) +
REPLICATE('A',60)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',120)
GO 1000
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',90) +
REPLICATE('F',10) +
REPLICATE('A',20)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',90) +
REPLICATE('G',10) +
REPLICATE('A',20)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',120)
GO 1000

CREATE STATISTICS StringTestStat ON dbo.StringTest(Col1) WITH FULLSCAN
DBCC SHOW_STATISTICS('StringTest','StringTestStat')

In the first section there are 30 rows with the substrings ‘BBBBBBBBBB’ and ‘CCCCCCCCCC’, in the second section there are 30 rows with the substrings ‘DDDDDDDDDD’ and ‘EEEEEEEEEE’, and in the last section there are 30 rows with the substring ‘FFFFFFFFFF’ and ‘GGGGGGGGGG’. What we should see are cardinality estimates for the B string and C string similar to the cardinality estimates for the F string and the G string (giggle). But the cardinality estimates for the D and E strings should be similar to cardinality estimates for a substring that doesn’t exist.

Substrings in the first 40 characters:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%BBBBBBBBBB%'

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%CCCCCCCCCC%'

Substring in the middle 40 characters:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%DDDDDDDDDD%'

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%EEEEEEEEEE%'

Substrings in the last 40 characters:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%FFFFFFFFFF%'

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%GGGGGGGGGG%'

Substring that doesn’t exist:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%SSSSSSSSSS%'

As predicted the estimates for the substrings in the first and last sections, the B and C substrings are similar to the estimates for the substrings in the last section, the F and G substrings. However the estimates for the middle section, the D and E substrings are similar to the estimates for substrings that do not exist in the column.

Given that there is so little documentation on this topic this is the best I can do to show how the string index can influence cardinality estimates.

My next post will be about statistics on computed columns.

Statistics (Part 4) – Multi-column cardinality

In the previous posts the predicates for the example queries have all had only one element, i.e. only one constraint against the table in the WHERE clause. While troubleshooting cardinality estimate problems it is helpful to understand what happens when there are two constraints against two different columns in the same table.

Update 2/22/2014:
Before reading this you might want to check out Paul White’s article on this topic where he gives a much better explanation than I did. He also includes the case of OR selectivity which didn’t even occur to me to investigate.

This is the sample query that will be used to show how cardinality estimates happen when there are two constraints.

SELECT * FROM dbo.AWStatTest where AddressLine1 = N'100, rue des Rosiers' and PostalCode = N'2010'

In Post 2 of this series I created a table called AWStatTest and filled it with data for demo purposes. So you don’t have to refer back to that post here is the DDL for that table and the index I created.

CREATE TABLE AWStatTest
(
ID int identity(1,1),
City nvarchar(30),
PostalCode nvarchar(15)
)

The index creation also includes a creation of statistics. This is a picture of the statistics for IX_AWStatTest and will be referenced in this post.

The predicate in the sample query is constraining the search by AddressLine1 and PostalCode. As shown in my post on density, the statistics IX_AWStatTest alone cannot be used for cardinality estimates for queries constrained by AddressLine1 and PostalCode.

There are at least two ways this can be estimated, 1) by using multiple stats or 2) by using a multi-column statistic

Using Multiple Stats
One of the features of SQL Server is the ability to create single column statistics on the fly in order to help out with optimization (if the database is set to auto create them anyway…my database is set up to do so). When I run the query above SQL Server created statistics for the column PostalCode. The name of those statistics is machine generated, _WA_Sys_00000005_5AEE82B9 (this will be named differently on your machine). You can see these stats with DBCC SHOW_STATISTICS.

DBCC SHOW_STATISTICS ('dbo.AWStatTest','_WA_Sys_00000005_5AEE82B9')

This statistics object has a density vector and histogram as well.

SQL Server can use the histogram for AddressLine1 from the statistic IX_AWStatTest and the histogram for PostalCode from the statistic _WA_Sys_00000005_5AEE82B9 to make a cardinality guess. The histogram for PostalCode shows that our constraint value ‘2010’ is a RANGE_HI_KEY and therefore EQ_ROWS will be used to provide cardinality estimates for this value. The number or rows in the table estimated to match postal code 2010 is 13714.8. Looking at the histogram for AddressLine1 shows that ‘100, rue des Rosiers’ is a RANGE_HI_KEY and EQ_ROWS would give us an estimate of 2020 rows if the predicate included a search for just this value.

So how are these used together? A quick example problem that should be easy for anyone familiar with playing cards to understand which helps us understand using two different histograms to make a cardinality estimate. There are thirteen cards in a suit and four suits per deck of standard playing cards. What is the probablity of drawing a face card that is a heart? In this case we are looking for cards that have two specific characteristics, a face card and a heart. There is a 13/52 chance of drawing a heart. There are 12 face cards (Jack, Queen, King) in a deck, so there is a 12/52 chance of drawing a face card.

The probability of drawing a face card that is also a heart is:
13/52 * 12/52 = .0576923.
If we were to treat this like a density and estimate the number of cards in the deck that would match our double criteria we would multiply .0576923 by the number or cards (rows) in the deck, .0576923 * 52 = 3. This is the exact number or cards matching the criteria we actually have.

Let’s translate this example to our cardinality problem. We are looking for rows of data that have two specific characteristics, AddressLine1 and PostalCode equal to specific values. In this situation the probability of a match is for AddressLine1 and Postal Code are 2020/4040008 and 13714.8/4040008 respectively.

Multiply these together by the number of rows in the table and the cardinality estimation is for the query is: (2020/4040008) * (13714.8/4040008) * 4040008 = 6.85739.

SELECT * FROM dbo.AWStatTest where AddressLine1 = N'11, quai de l´ Iton' and PostalCode = N'94010' and City = N'Bellingham'

The cardinality estimates for each of these values individual are, 9090, 42062.04, and 58405.71 respectively.

(9090/4040008) * (42062.04/4040008) * (58405.71/4040008) * 4040008 = 1.36819

The examples above all used values that were RANGE_HI_KEYS from the respective histograms. In my tests I have seen that estimates using values from the AVG_RANGE_ROWS column work exactly the same way.

Using multiple statistics objects isn’t the only way SQL Server can estimate cardinality on a table with multiple predicates.

Using a multi-column density
In Post 2 I talked about density vectors and how they can be used. They can also be used for multi-column estimates. To force SQL Server to use a multi-column density for cardinality estimation I am going to create a new table with the same data as the old one, turn off auto creation of statistics and create a statistics object for just PostalCode and AddressLine1. The primary reason I’m creating a new table is because I’ve done so much testing on my other one that I fear being unable to get pure estimates.

SELECT * INTO dbo.AWAL1PCTest from dbo.AWStatTest
ALTER DATABASE TestDB SET AUTO_CREATE_STATISTICS OFF

I should end up with a statistics object for PostalCode and AddressLine1 that looks like this:

The density value for the combination of AddressLine1 and PostalCode is .000249066
The estimate should be .000249066 * 4040008 = 1006.23

Running this query:

SELECT * FROM dbo.AWStatTest where AddressLine1 = N'100, rue des Rosiers' and PostalCode = N'2010'

The cardinality estimates is

In this post we have seen that cardinality estimates for predicates that contain multiple matching conditions for two different columns on the same table can be estimated in at least two ways.
1. By using the histogram values from multiple statistics objects
2. By using a multi-column density value

The point of the previous three posts is not to outline all of the ways the density vector and histogram parts of statistics objects are used make cardinality estimates. That information is not publically available (that I can find anyway). Most of what I talk about in the previous three posts comes from performing a lot of tests and comparing results to calculations. The purpose is to highlight how the various components can be involved so that one can have more success in troubleshooting cardinality estimate issues. I’ve often been surprised at some of the estimates that come up and they don’t always align themselves to the “rules” I’ve laid out above, though they usually do. That being said, tracing cardinality estimates back to specific statistics used has been helpful in identifying skewed data distributions and possible places for optimizations.

The next topic is an area that is very opaque and, I suspect, is one of the causes of some of the odd cardinality estimates I’ve seen: the String Index.