Statistics (Part 3) – Histograms

As noted in the previous entry density can be used to predict the number of records that will be returned from a query against a table. This is fine if there is an even distribution of data but what about the cases where there is not.

As demonstrated in the previous post the density for the column AddressLine1 of 0.0003552398 multiplied by the number of rows in the table, 4040008, gives us 1435.17, which is the number of rows the query optimizer estimates will be returned from a query against the table with a single predicate against the column AddressLine1. Given an even distribution of data each unique value in the column would exist in the set roughly 1435 times. SQL Server can use the density vector to estimate the number or rows that will be returned from a query. The caveat here is ‘given and even distribution’. What if the distribution is uneven?

What if you have 10000 rows with 9999 values in a column equal to A and 1 equal to B? This was the example I set up in my first post in this series. In this case the density would be 1/Number of Unique Values or ½ or .5. If we use the technique of multiplying the density by the number of rows in the table we would get an estimate of 5000 rows for each value. This could cause the optimizer to produce a plan that scans the entire table as opposed to the plan that queries an index and does a book mark lookup. If the query is looking for records where the value of AddressLIne1 is A plan produced is OK since it is likely a table scan will be used anyway. But looking for values of B would require a table scan to find a single record. On the flip side using bookmark lookups to retreive 9999 rows when a table scan would require less I/O is not a good idea either. I refer you to my first post in the series where I showed the number of physical reads required from a table scan relative to a bookmark lookup.

SQL Server can use histograms to make more refined predictions.

In my previous post I created a test table with 4040008 rows and an index on the column AddressLine1. The index automatically creates a statisitcs objects. You can see the histogram of the statistics attached to the demo index by using DBCC SHOW_STATISTICS. I use the WITH HISTOGRAM option so that only the histogram is returned.

DBCC SHOW_STATISTICS('AWSTatTest','IX_AWStatTest')
WITH HISTOGRAM

One thing to note about the histogram is that the values of RANGE_HI_KEY are sorted in ascending order. All values not equal to a RANGE_HI_KEY naturally will be of greater or lesser value than the RANGE_HI_KEYs and, subsequently, can be thought of as existing between two keys that are side by side. If our range hi keys are 5, 10, 20, and 25 the value 7 exists between the entries of 5 and 10. While it also exists between the entries of 5 and 20 the next highest entry to 7 is 10. For query optimization the data in the row of the next highest entry is used to for cardinality estimates. In our example the value of ‘081, boulevard du Montparnasse’ exists between ‘#500-75 O’Connor Street’ and ‘1, place Beaubernard’ when these values are sorted in ascending order.

With this we know then there are two types of estimates that can be made from the histogram. Values equal to the RANGE_HI_KEY and values that are between two RANGE_HI_KEYs which are side by side.

The cardinality estimates for predicates equal to the RANGE_HI_KEY come from the EQ_ROWS column. If we query the table for the value, “1, place Beaubernard” in the column AddressLine1, the optimizer will predict 4040 rows to be returned.

SELECT * FROM dbo.AWStatTest where AddressLine1 = '1, place Beaubernard'

With a full sample the EQ_ROWS value for each RANGE_HI_KEYS show the exact number of times that each of the RANGE_HI_KEYS value is in the index/heap’s column. The values of in the RANGE_ROWS and DISTINCT_RANGE_ROWS are used to calculate AVG_RANGE_ROWS which, in turn, is used to provide cardinality estimates for values that don’t match the RANGE_HI_KEYs.

RANGE_ROWS is the number or records which contain a value that sorts between two side by side range rows, the current row and the next lowest row. In our example range rows for ‘1, place Beaubernard’ is 14147. This means there are 14147 records which contain a value between ‘1, place Beaubernard’ and ‘#500-75 O’Connor Street’.

We can use this query to test this.

SELECT COUNT(*) FROM dbo.AWStatTest
WHERE AddressLine1 > '#500-75 O''Connor Street'
AND AddressLine1 < '1, place Beaubernard'

DISTINCT_RANGE_ROWS are the number of distinct values between two range hi keys.

SELECT COUNT(*) [count]
FROM
(
SELECT DISTINCT AddressLine1
FROM dbo.AWStatTest
WHERE AddressLine1 < '1, place Beaubernard'
AND AddressLine1 > '#500-75 O''Connor Street'
) AA

If you recall from the previous lesson density is calculated as 1/Unique number of values. So we can calculate the density of the range between two RANGE_HI_KEYS. The density of the range between ‘1, place Beaubernard’ and ‘#500-75 O’Connor Street’ is 1/13 or .0769231.

Density times the number of rows gives us the average number of times each value will exist in this range. .0769231 X 14147 = 1088.231 which is the value of the AVG_RANGE_ROWS column for the RANGE_HI_KEY ‘1, place Beaubernard’. If we query AddressLine1 for a value that exists in this range the number of rows estimated to be returned should be 1088.231.

SELECT * FROM dbo.AWStatTest where AddressLine1 = '081, boulevard du Montparnasse'

This density estimation between the RANGE_HI_KEYs works just like the density estimation for the entire column but with more precision. It can be more precise because the range is smaller and because as the histogram is built SQL Server can choose range hi keys so that the steps between them have a more even distribution.

These are examples of single columns statistics being used with queries against individual columns. What happens if the query against the table contains two predicates. That is the focus of the next post.

One thought on “Statistics (Part 3) – Histograms

Comments are closed.