Statistics (Part 1) – What would you say you do here?

This is the first post in a series of posts I will be doing about SQL Server statistics.

As I got more serious about learning SQL Server one of the things I had a hard time wrapping my mind around was statistics. In the abstract I could understand what they were for but beyond that I didn’t understand how they worked. Now that I’ve learned a lot about them I figure others could use some simple instructions on what they are. Hopefully this is simple enough.

Stats are the unsung hero of query optimization. With good stats you can get a well optimized query. With bad stats…not so good. I’ve seen 100x performance gains from implementing good statistics.

Statistics help the query optimizer determine the best execution plan to use to fulfill a query by predicting the number of rows that will be returned from operations. Knowing the number of rows that will be returned by different operations the optimizer can choose the combinations of operators that will best (lowest cost – quickest, less I/O, less CPU) fulfill the query.

The best way to explain this is with an example. I’m going to create a table with three columns, an IDENTITY column, a char(1) column and a varchar(10) column. There will be 10,000 rows. In 9999 of them the character column will be the value “A” while 1 of them is the value “B”. The varchar(10) column is some throw away data. I will put an index on the character column, Col1. The index will automatically create statistics. Then I can run a couple of queries against Col1 to see what types of query plans are generated.

CREATE TABLE dbo.StatTest
(ID int IDENTITY(1,1) NOT NULL,Col1 char(1),Col2 varchar(10))

CREATE CLUSTERED INDEX IC_StatTest ON dbo.StatTest(ID)

CREATE INDEX IX_StatTest ON dbo.StatTest(Col1)
GO

INSERT INTO dbo.StatTest (Col1,Col2) VALUES ('A','ABCDEFGHIJ')
GO 9999

INSERT INTO dbo.StatTest (Col1,Col2) VALUES ('B','ABCDEFGHIJ')
GO

The first test will be this query with the predicate searching for records where Col1 matches “A”.

SELECT * FROM dbo.StatTest WHERE Col1 = 'A'

This is the query plan produced. It is a clustered index scan.

Looking at the properties of the clustered index scan operator it shows and estimate of 9999 rows to be returned.

However with this execution plan of this query using a different predicate results in a different query plan.

SELECT * FROM dbo.StatTest WHERE Col1 = 'B'

This plan uses an index seek/bookmark lookup combination. The optimizer estimated that 1 row would be returned for this value. In the previous query an index scan was the optimal way to retrieve all of the rows where Col1 = A. In the second query and index seek and book mark lookup are the optimal.

It was the statistics that enabled the optimizer to craft a query plan that was optimal for each data condition. Without the statistics the optimizer wouldn’t have had any idea how many rows would be returned or what the best method of accessing the data was.

The consequences of using the wrong plan are fairly noticeable. Imagine if the query optimizer had no way of knowing there was only one value of ‘B’ in Col1 and chose to do a table scan instead of a bookmark lookup.

By using SET STATISTICS IO ON we can see how may logical reads were required to fulfill this query when using a bookmark lookup.

Table ‘StatTest’. Scan count 1, logical reads 4, physical reads 3, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Now forcing it to use a table scan instead.

Table ‘StatTest’. Scan count 1, logical reads 40, physical reads 1, read-ahead reads 40, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

The table scan required 41 physical reads (physical reads and read-ahead reads) while the bookmark lookup required only 3. The number of physical reads to perform the scan is indefinite and related to the size of the table. The number of physical reads to perform the bookmark lookup may increase slightly as the size of the index grows (and the B-tree grows deeper) but nothing like the number a table scan will need. Why use a table scan to find a single record when a key lookup is faster? Likewise why use key lookups to lookup up multiple records when a table scan is faster? It is statistics that helps the optmizer to determine the best method to access the data.

In the next post I look at the statistics object and start to examine how the optimizer uses it to get cardinality estimates.

2 thoughts on “Statistics (Part 1) – What would you say you do here?

Comments are closed.