Statistics (Part 2) – I am your density

In order to show how statistics work I need to first create some data to work with. This table and the one created in the first post of this series will be used in this example.

CREATE TABLE AWStatTest
(
ID int identity(1,1),
AddressLine1 nvarchar(60),
AddressLine2 nvarchar(60),
City nvarchar(30),
PostalCode nvarchar(15)
)

Using the following command with some variability of the TOP clause and the number of execution specified on the GO command I created a table with some data to test against.

INSERT INTO AWStatTest (AddressLine1,AddressLine2, City,PostalCode)
SELECT TOP 1000 AddressLine1,AddressLine2,City,PostalCode
FROM AdventureWorks.Person.Address
GO 1000

The content of the table looks something like this. It has 4040008 rows.

I then created this index on it (which also creates statistics of the same name) and look at the stats using DBCC SHOW_STATISTICS.

CREATE INDEX IX_AWStatTest ON dbo.AWStatTest (AddressLine1,AddressLine2,City,PostalCode)
GO
DBCC SHOW_STATISTICS('dbo.AWStatTest','IX_AWStatTest')
GO

The properties for the statistics IX_AWStatTest looks like this.

The information displayed by DBCC SHOW_STATISTICS is divided into three sections: header, density vector and histogram. There is useful information in the header but the density vector and histogram are the most interesting stuff when it comes to understanding how cardinality estimates are done.

In order to understand the density vector you have to understand what density is. Density is a measurement of how selective an index is. It can be calculated by the formula: 1/(Number of Unique Values in the set). In our case the set is the column (or columns) upon which the statistics are built.

If ‘A’ is in the column 10 times, ‘B’ is in the column 20 times and ‘C’ is in the column 1000 times the number of unique values is 3, ‘A’, ‘B’ and ‘C’.

To get the number of unique values you can use this query:

SELECT COUNT(*)
FROM (SELECT DISTINCT addressline1 FROM dbo.AWStatTest) AA

In our case there are 2815 unique values for AddressLine1. 1/2815 = 0.0003552398. Now take a gander at the All Density value for the column AddressLine1 in the 2nd section of the statistics example. It is .0003552398. This is how the density value in the second section is calculated.

Take the density and multiply it by the number or rows in the table .0003552398 * 4040008 = 1435.1715. This means that if there were an even distribution of each of the unique values there would be roughly 1435 rows each. We can check this by multipying 1435.1715 by 2815 (the number of unique values) which comes out to 4040007.7725.

It can also be calculated by the formula (Average Number of Duplicates/Number of Rows).

In this example the average number of duplicates is going to be the (number of rows)/(number of distinct values). The average number of duplicates is 4040008/2815 = 1435.1715 (look familiar?). Divide this by the number of rows and you get the same density value as the 1st formula. We can actually derive the first formula from the second.

Knowing that the two formulas reduce to the same thing I will use 1/2815 for simplicity.

This is for a single column. How about the density on multiple columns? When we look at the combination of columns the density is based on the uniqueness of those two columns.

SELECT COUNT(*) [ALL DENSITY]
FROM
(
SELECT DISTINCT AddressLine1,AddressLine2 FROM dbo.AWStatTest
) AA

This query tells us there are 2833 unique values for the combination of AddressLine1 and AddressLine2. 1/2833 = 0.0003529827, which matches the All density value in the second section for AddressLine1 and AddressLine2.

So what can these density values do for us? They can help us estimate how many times a specific value (or combination of values) exists in a set of data by multiplying the density by the number of rows in the set (table).

This example shows the optimizer using the Density value from the density vector. I’m specifically using the variable here in the predicate to cause SQL Server to use the density vector. I will explain in a future post what makes this happen.

DECLARE @address nvarchar(50)
SET @address = 'test'
SELECT * FROM dbo.AWStatTest WHERE AddressLine1 = @address

A snapshot of the estimated number of rows to be returned.

1435.17. Does that number look familiar? It is the density of AddressLine1 .0003552398, multiplied by the number of rows in the table, 4040008.

If we selected on AddressLine1 and AddressLine2 the density used would have been 0.0003529827.

When SQL Server creates statistics on multiple columns it creates combinations of density vectors in the order the columns are specified in the CREATE STATISTICS command (or in our case the CREATE INDEX command because statistics are created as a by-product of index creation). For the statistics associated with the IX_AWStatTest index the columns are AddressLine1, AddressLine2, City and PostalCode. The four density values computed are
AddressLine1,
AddressLine1, AddressLine2
AddressLine1, AddressLine2, City
AddressLine1, AddressLine2, City, PostalCode.

Recall that density is a factor of the number of unique values and if the density is computed on multiple columns it is the number of unique values of that combination of values.

In these statistics the number of unique values of AddressLine1 are computed, the number of unique values of the combination of AddressLine1 and AddressLine2, the number of unique values of the combination of AddressLine1, AddressLine2, and City, and finally the number of unique values of the combination of AddressLine1, AddressLine2, City, and Postal Code.

The important thing to note is that just because two columns are in the same statistic doesn’t mean the density can be used. In this example the columns AddressLine1 and PostalCode are both represented in forth density value but this line is not the combination of AddressLine1 and PostalCode only. That combination doesn’t exist in these statistics. If you need that combination you would have to create another statistics object.

Density if a great tool for calculating cardinality estimates and is used often by the query optimizer but there is a problem. This is a good graphical representation of what the problem is.

Each of these columns has the same density even though the different values (red and green) are only distributed evenly (i.e. the same number of each not the fact they are ordered red to green) in one of the columns. Estimates caluculated based on density be grossly inaccurate if the data distribution is skewed. In this case there skew isn’t significant because there are only ten rows. How about the case where there are 100,000 rows and only one is read. The estimate would be 50,000, wildly over for one and wildly under for the other.

To get much, much better estimates SQL Server uses histograms which will be the topic of the next post.

One thought on “Statistics (Part 2) – I am your density

Comments are closed.