Sometimes the optimizer has absolutely no idea how to estimate cardinality. There are either no statistics and auto-create statistics has been disabled or the predicate isn’t of a form that SQL Server can use any of the existing statistics. There was an example of the latter problem in my last blog post about stats on computed columns. On these occasions the optimizer must be like Spock.
In order to return us to the exact moment at which we left the 23rd Century, I have used our journey back through time as a referent, calculating the coefficient of elapsed time in relation to the acceleration curve.
So what is your problem?
Acceleration is no longer a constant.
Well, you’re gonna have to take your best shot.
Guess, Spock. Your best guess.
Guessing is not in my nature.
Well nobody’s perfect.
Unlike Mr. Spock guessing is definitely in the nature of the optimizer and when it has no other way to estimate cardinality it will simply guess. Well, it is always guessing but most of the time the guess is educated, informed by stats. Without information to go by the guesses are so unbelievably bad you’d think you’d stumbled into a screening of Highlander 2: The Quickening. There are some rules it has to go by but ultimately it is taking completely uneducated guesses. However I’ve found that knowing how it is going to guess has been helpful in discovering that the optimizer is guessing. To my knowledge there is no flag or other notification that it has taken a complete WAG. And knowing that it is guessing helps track down why we have crappy query plans and points us in a direction where we might be able to fix them.
For years I’ve read that if the optimizer doesn’t have statistics to go on it will estimate 10% of the rows in the table for straight equality searches.
Consider this query.
On a table with 12,000 rows this would be estimated at 1,200 rows returned. In preparation for this series of blog posts I tested everything I was asserting before putting it on the page. The only time I could come up with a 10% of row count cardinality estimate was with 10,000 rows on the table. I also could not figure out what percentage was being used. If I changed the table size the percentage changed. At that point I suspected that the cardinality estimate was logarithmic in nature and changed based on the size of the base table.
So I created many test tables with each with a different number of rows, 1000, 2000, 3000, 4000, 5000, 10000, 11000, etc. I plotting the numbers of rows estimated on a graph and, with the help of someone who knows what he is doing, was able to figure out the formula that fit the curve.
It turns out that the instead of 10% the estimate is ROWS ^ .75. This explained why my 10000 row table gave me a 10% estimate (10000^.75 = 1000 = 10000 * .10.) It was pure coincidence.
I then tested a query whose predicate queried against two columns.
I expected to use the same formula but it didn’t work. Plotting the numbers on a graph showed that the formula is ROWS ^ .6875.
I was about to give up when I ran across this blog post from Ian Jose which showed the factor constant used for cardinality estimates with a different number of equality matches in the predicate. I’m repeating the values here in case that page ever goes away.
1 column – Rows ^ (3/4) = Rows ^ .75
2 column – Rows ^ (11/16) = Rows ^ .6875
3 column – Rows ^ (43/64)
4 column – Rows ^ (171/256)
5 column – Rows ^ (170/256)
175 column – Rows ^ (0/256)
I tested this out to four column equality comparisons and decided that was good enough and I would believe Ian’s numbers.
For other situations the magic densities I’ve read about all were correct. BETWEEN uses 9% of row count and <, >, <=, =>, use 30% of row count.
Ultimately if the optimizer is using magic densities it is a good idea to figure that out and create the stats or fix the code that allow it to make educated guesses. Your instance will love you for it.