# Statistics (Part 4) – Multi-column cardinality

In the previous posts the predicates for the example queries have all had only one element, i.e. only one constraint against the table in the WHERE clause. While troubleshooting cardinality estimate problems it is helpful to understand what happens when there are two constraints against two different columns in the same table.

Update 2/22/2014:
Before reading this you might want to check out Paul White’s article on this topic where he gives a much better explanation than I did. He also includes the case of OR selectivity which didn’t even occur to me to investigate.

This is the sample query that will be used to show how cardinality estimates happen when there are two constraints.

SELECT * FROM dbo.AWStatTest where AddressLine1 = N'100, rue des Rosiers' and PostalCode = N'2010'

In Post 2 of this series I created a table called AWStatTest and filled it with data for demo purposes. So you don’t have to refer back to that post here is the DDL for that table and the index I created.

CREATE TABLE AWStatTest
(
ID int identity(1,1),
City nvarchar(30),
PostalCode nvarchar(15)
)

The index creation also includes a creation of statistics. This is a picture of the statistics for IX_AWStatTest and will be referenced in this post.

The predicate in the sample query is constraining the search by AddressLine1 and PostalCode. As shown in my post on density, the statistics IX_AWStatTest alone cannot be used for cardinality estimates for queries constrained by AddressLine1 and PostalCode.

There are at least two ways this can be estimated, 1) by using multiple stats or 2) by using a multi-column statistic

Using Multiple Stats
One of the features of SQL Server is the ability to create single column statistics on the fly in order to help out with optimization (if the database is set to auto create them anyway…my database is set up to do so). When I run the query above SQL Server created statistics for the column PostalCode. The name of those statistics is machine generated, _WA_Sys_00000005_5AEE82B9 (this will be named differently on your machine). You can see these stats with DBCC SHOW_STATISTICS.

DBCC SHOW_STATISTICS ('dbo.AWStatTest','_WA_Sys_00000005_5AEE82B9')

This statistics object has a density vector and histogram as well.

SQL Server can use the histogram for AddressLine1 from the statistic IX_AWStatTest and the histogram for PostalCode from the statistic _WA_Sys_00000005_5AEE82B9 to make a cardinality guess. The histogram for PostalCode shows that our constraint value ‘2010’ is a RANGE_HI_KEY and therefore EQ_ROWS will be used to provide cardinality estimates for this value. The number or rows in the table estimated to match postal code 2010 is 13714.8. Looking at the histogram for AddressLine1 shows that ‘100, rue des Rosiers’ is a RANGE_HI_KEY and EQ_ROWS would give us an estimate of 2020 rows if the predicate included a search for just this value.

So how are these used together? A quick example problem that should be easy for anyone familiar with playing cards to understand which helps us understand using two different histograms to make a cardinality estimate. There are thirteen cards in a suit and four suits per deck of standard playing cards. What is the probablity of drawing a face card that is a heart? In this case we are looking for cards that have two specific characteristics, a face card and a heart. There is a 13/52 chance of drawing a heart. There are 12 face cards (Jack, Queen, King) in a deck, so there is a 12/52 chance of drawing a face card.

The probability of drawing a face card that is also a heart is:
13/52 * 12/52 = .0576923.
If we were to treat this like a density and estimate the number of cards in the deck that would match our double criteria we would multiply .0576923 by the number or cards (rows) in the deck, .0576923 * 52 = 3. This is the exact number or cards matching the criteria we actually have.

Let’s translate this example to our cardinality problem. We are looking for rows of data that have two specific characteristics, AddressLine1 and PostalCode equal to specific values. In this situation the probability of a match is for AddressLine1 and Postal Code are 2020/4040008 and 13714.8/4040008 respectively.

Multiply these together by the number of rows in the table and the cardinality estimation is for the query is: (2020/4040008) * (13714.8/4040008) * 4040008 = 6.85739.

SELECT * FROM dbo.AWStatTest where AddressLine1 = N'11, quai de l´ Iton' and PostalCode = N'94010' and City = N'Bellingham'

The cardinality estimates for each of these values individual are, 9090, 42062.04, and 58405.71 respectively.

(9090/4040008) * (42062.04/4040008) * (58405.71/4040008) * 4040008 = 1.36819

The examples above all used values that were RANGE_HI_KEYS from the respective histograms. In my tests I have seen that estimates using values from the AVG_RANGE_ROWS column work exactly the same way.

Using multiple statistics objects isn’t the only way SQL Server can estimate cardinality on a table with multiple predicates.

Using a multi-column density
In Post 2 I talked about density vectors and how they can be used. They can also be used for multi-column estimates. To force SQL Server to use a multi-column density for cardinality estimation I am going to create a new table with the same data as the old one, turn off auto creation of statistics and create a statistics object for just PostalCode and AddressLine1. The primary reason I’m creating a new table is because I’ve done so much testing on my other one that I fear being unable to get pure estimates.

SELECT * INTO dbo.AWAL1PCTest from dbo.AWStatTest
ALTER DATABASE TestDB SET AUTO_CREATE_STATISTICS OFF

I should end up with a statistics object for PostalCode and AddressLine1 that looks like this:

The density value for the combination of AddressLine1 and PostalCode is .000249066
The estimate should be .000249066 * 4040008 = 1006.23

Running this query:

SELECT * FROM dbo.AWStatTest where AddressLine1 = N'100, rue des Rosiers' and PostalCode = N'2010'

The cardinality estimates is

In this post we have seen that cardinality estimates for predicates that contain multiple matching conditions for two different columns on the same table can be estimated in at least two ways.
1. By using the histogram values from multiple statistics objects
2. By using a multi-column density value

The point of the previous three posts is not to outline all of the ways the density vector and histogram parts of statistics objects are used make cardinality estimates. That information is not publically available (that I can find anyway). Most of what I talk about in the previous three posts comes from performing a lot of tests and comparing results to calculations. The purpose is to highlight how the various components can be involved so that one can have more success in troubleshooting cardinality estimate issues. I’ve often been surprised at some of the estimates that come up and they don’t always align themselves to the “rules” I’ve laid out above, though they usually do. That being said, tracing cardinality estimates back to specific statistics used has been helpful in identifying skewed data distributions and possible places for optimizations.

The next topic is an area that is very opaque and, I suspect, is one of the causes of some of the odd cardinality estimates I’ve seen: the String Index.

# Statistics (Part 1) – What would you say you do here?

This is the first post in a series of posts I will be doing about SQL Server statistics.

As I got more serious about learning SQL Server one of the things I had a hard time wrapping my mind around was statistics. In the abstract I could understand what they were for but beyond that I didn’t understand how they worked. Now that I’ve learned a lot about them I figure others could use some simple instructions on what they are. Hopefully this is simple enough.

Stats are the unsung hero of query optimization. With good stats you can get a well optimized query. With bad stats…not so good. I’ve seen 100x performance gains from implementing good statistics.

Statistics help the query optimizer determine the best execution plan to use to fulfill a query by predicting the number of rows that will be returned from operations. Knowing the number of rows that will be returned by different operations the optimizer can choose the combinations of operators that will best (lowest cost – quickest, less I/O, less CPU) fulfill the query.

The best way to explain this is with an example. I’m going to create a table with three columns, an IDENTITY column, a char(1) column and a varchar(10) column. There will be 10,000 rows. In 9999 of them the character column will be the value “A” while 1 of them is the value “B”. The varchar(10) column is some throw away data. I will put an index on the character column, Col1. The index will automatically create statistics. Then I can run a couple of queries against Col1 to see what types of query plans are generated.

CREATE TABLE dbo.StatTest
(ID int IDENTITY(1,1) NOT NULL,Col1 char(1),Col2 varchar(10))

CREATE CLUSTERED INDEX IC_StatTest ON dbo.StatTest(ID)

CREATE INDEX IX_StatTest ON dbo.StatTest(Col1)
GO

INSERT INTO dbo.StatTest (Col1,Col2) VALUES ('A','ABCDEFGHIJ')
GO 9999

INSERT INTO dbo.StatTest (Col1,Col2) VALUES ('B','ABCDEFGHIJ')
GO

The first test will be this query with the predicate searching for records where Col1 matches “A”.

SELECT * FROM dbo.StatTest WHERE Col1 = 'A'

This is the query plan produced. It is a clustered index scan.

Looking at the properties of the clustered index scan operator it shows and estimate of 9999 rows to be returned.

However with this execution plan of this query using a different predicate results in a different query plan.

SELECT * FROM dbo.StatTest WHERE Col1 = 'B'

This plan uses an index seek/bookmark lookup combination. The optimizer estimated that 1 row would be returned for this value. In the previous query an index scan was the optimal way to retrieve all of the rows where Col1 = A. In the second query and index seek and book mark lookup are the optimal.

It was the statistics that enabled the optimizer to craft a query plan that was optimal for each data condition. Without the statistics the optimizer wouldn’t have had any idea how many rows would be returned or what the best method of accessing the data was.

The consequences of using the wrong plan are fairly noticeable. Imagine if the query optimizer had no way of knowing there was only one value of ‘B’ in Col1 and chose to do a table scan instead of a bookmark lookup.

By using SET STATISTICS IO ON we can see how may logical reads were required to fulfill this query when using a bookmark lookup.