Statistics (Part 3) – Histograms

As noted in the previous entry density can be used to predict the number of records that will be returned from a query against a table. This is fine if there is an even distribution of data but what about the cases where there is not.

As demonstrated in the previous post the density for the column AddressLine1 of 0.0003552398 multiplied by the number of rows in the table, 4040008, gives us 1435.17, which is the number of rows the query optimizer estimates will be returned from a query against the table with a single predicate against the column AddressLine1. Given an even distribution of data each unique value in the column would exist in the set roughly 1435 times. SQL Server can use the density vector to estimate the number or rows that will be returned from a query. The caveat here is ‘given and even distribution’. What if the distribution is uneven?

What if you have 10000 rows with 9999 values in a column equal to A and 1 equal to B? This was the example I set up in my first post in this series. In this case the density would be 1/Number of Unique Values or ½ or .5. If we use the technique of multiplying the density by the number of rows in the table we would get an estimate of 5000 rows for each value. This could cause the optimizer to produce a plan that scans the entire table as opposed to the plan that queries an index and does a book mark lookup. If the query is looking for records where the value of AddressLIne1 is A plan produced is OK since it is likely a table scan will be used anyway. But looking for values of B would require a table scan to find a single record. On the flip side using bookmark lookups to retreive 9999 rows when a table scan would require less I/O is not a good idea either. I refer you to my first post in the series where I showed the number of physical reads required from a table scan relative to a bookmark lookup.

SQL Server can use histograms to make more refined predictions.

In my previous post I created a test table with 4040008 rows and an index on the column AddressLine1. The index automatically creates a statisitcs objects. You can see the histogram of the statistics attached to the demo index by using DBCC SHOW_STATISTICS. I use the WITH HISTOGRAM option so that only the histogram is returned.

DBCC SHOW_STATISTICS('AWSTatTest','IX_AWStatTest')
WITH HISTOGRAM

One thing to note about the histogram is that the values of RANGE_HI_KEY are sorted in ascending order. All values not equal to a RANGE_HI_KEY naturally will be of greater or lesser value than the RANGE_HI_KEYs and, subsequently, can be thought of as existing between two keys that are side by side. If our range hi keys are 5, 10, 20, and 25 the value 7 exists between the entries of 5 and 10. While it also exists between the entries of 5 and 20 the next highest entry to 7 is 10. For query optimization the data in the row of the next highest entry is used to for cardinality estimates. In our example the value of ‘081, boulevard du Montparnasse’ exists between ‘#500-75 O’Connor Street’ and ‘1, place Beaubernard’ when these values are sorted in ascending order.

With this we know then there are two types of estimates that can be made from the histogram. Values equal to the RANGE_HI_KEY and values that are between two RANGE_HI_KEYs which are side by side.

The cardinality estimates for predicates equal to the RANGE_HI_KEY come from the EQ_ROWS column. If we query the table for the value, “1, place Beaubernard” in the column AddressLine1, the optimizer will predict 4040 rows to be returned.

SELECT * FROM dbo.AWStatTest where AddressLine1 = '1, place Beaubernard'

With a full sample the EQ_ROWS value for each RANGE_HI_KEYS show the exact number of times that each of the RANGE_HI_KEYS value is in the index/heap’s column. The values of in the RANGE_ROWS and DISTINCT_RANGE_ROWS are used to calculate AVG_RANGE_ROWS which, in turn, is used to provide cardinality estimates for values that don’t match the RANGE_HI_KEYs.

RANGE_ROWS is the number or records which contain a value that sorts between two side by side range rows, the current row and the next lowest row. In our example range rows for ‘1, place Beaubernard’ is 14147. This means there are 14147 records which contain a value between ‘1, place Beaubernard’ and ‘#500-75 O’Connor Street’.

We can use this query to test this.

SELECT COUNT(*) FROM dbo.AWStatTest
WHERE AddressLine1 > '#500-75 O''Connor Street'
AND AddressLine1 < '1, place Beaubernard'

DISTINCT_RANGE_ROWS are the number of distinct values between two range hi keys.

SELECT COUNT(*) [count]
FROM
(
FROM dbo.AWStatTest
WHERE AddressLine1 < '1, place Beaubernard'
AND AddressLine1 > '#500-75 O''Connor Street'
) AA

If you recall from the previous lesson density is calculated as 1/Unique number of values. So we can calculate the density of the range between two RANGE_HI_KEYS. The density of the range between ‘1, place Beaubernard’ and ‘#500-75 O’Connor Street’ is 1/13 or .0769231.

Density times the number of rows gives us the average number of times each value will exist in this range. .0769231 X 14147 = 1088.231 which is the value of the AVG_RANGE_ROWS column for the RANGE_HI_KEY ‘1, place Beaubernard’. If we query AddressLine1 for a value that exists in this range the number of rows estimated to be returned should be 1088.231.

SELECT * FROM dbo.AWStatTest where AddressLine1 = '081, boulevard du Montparnasse'

This density estimation between the RANGE_HI_KEYs works just like the density estimation for the entire column but with more precision. It can be more precise because the range is smaller and because as the histogram is built SQL Server can choose range hi keys so that the steps between them have a more even distribution.

These are examples of single columns statistics being used with queries against individual columns. What happens if the query against the table contains two predicates. That is the focus of the next post.

Statistics (Part 2) – I am your density

In order to show how statistics work I need to first create some data to work with. This table and the one created in the first post of this series will be used in this example.

CREATE TABLE AWStatTest
(
ID int identity(1,1),
City nvarchar(30),
PostalCode nvarchar(15)
)

Using the following command with some variability of the TOP clause and the number of execution specified on the GO command I created a table with some data to test against.

GO 1000

The content of the table looks something like this. It has 4040008 rows.

I then created this index on it (which also creates statistics of the same name) and look at the stats using DBCC SHOW_STATISTICS.

GO
DBCC SHOW_STATISTICS('dbo.AWStatTest','IX_AWStatTest')
GO

The properties for the statistics IX_AWStatTest looks like this.

The information displayed by DBCC SHOW_STATISTICS is divided into three sections: header, density vector and histogram. There is useful information in the header but the density vector and histogram are the most interesting stuff when it comes to understanding how cardinality estimates are done.

In order to understand the density vector you have to understand what density is. Density is a measurement of how selective an index is. It can be calculated by the formula: 1/(Number of Unique Values in the set). In our case the set is the column (or columns) upon which the statistics are built.

If ‘A’ is in the column 10 times, ‘B’ is in the column 20 times and ‘C’ is in the column 1000 times the number of unique values is 3, ‘A’, ‘B’ and ‘C’.

To get the number of unique values you can use this query:

SELECT COUNT(*)
FROM (SELECT DISTINCT addressline1 FROM dbo.AWStatTest) AA

In our case there are 2815 unique values for AddressLine1. 1/2815 = 0.0003552398. Now take a gander at the All Density value for the column AddressLine1 in the 2nd section of the statistics example. It is .0003552398. This is how the density value in the second section is calculated.

Take the density and multiply it by the number or rows in the table .0003552398 * 4040008 = 1435.1715. This means that if there were an even distribution of each of the unique values there would be roughly 1435 rows each. We can check this by multipying 1435.1715 by 2815 (the number of unique values) which comes out to 4040007.7725.

It can also be calculated by the formula (Average Number of Duplicates/Number of Rows).

In this example the average number of duplicates is going to be the (number of rows)/(number of distinct values). The average number of duplicates is 4040008/2815 = 1435.1715 (look familiar?). Divide this by the number of rows and you get the same density value as the 1st formula. We can actually derive the first formula from the second.

Knowing that the two formulas reduce to the same thing I will use 1/2815 for simplicity.

This is for a single column. How about the density on multiple columns? When we look at the combination of columns the density is based on the uniqueness of those two columns.

SELECT COUNT(*) [ALL DENSITY]
FROM
(
) AA

This query tells us there are 2833 unique values for the combination of AddressLine1 and AddressLine2. 1/2833 = 0.0003529827, which matches the All density value in the second section for AddressLine1 and AddressLine2.

So what can these density values do for us? They can help us estimate how many times a specific value (or combination of values) exists in a set of data by multiplying the density by the number of rows in the set (table).

This example shows the optimizer using the Density value from the density vector. I’m specifically using the variable here in the predicate to cause SQL Server to use the density vector. I will explain in a future post what makes this happen.

A snapshot of the estimated number of rows to be returned.

1435.17. Does that number look familiar? It is the density of AddressLine1 .0003552398, multiplied by the number of rows in the table, 4040008.

If we selected on AddressLine1 and AddressLine2 the density used would have been 0.0003529827.

When SQL Server creates statistics on multiple columns it creates combinations of density vectors in the order the columns are specified in the CREATE STATISTICS command (or in our case the CREATE INDEX command because statistics are created as a by-product of index creation). For the statistics associated with the IX_AWStatTest index the columns are AddressLine1, AddressLine2, City and PostalCode. The four density values computed are

Recall that density is a factor of the number of unique values and if the density is computed on multiple columns it is the number of unique values of that combination of values.

In these statistics the number of unique values of AddressLine1 are computed, the number of unique values of the combination of AddressLine1 and AddressLine2, the number of unique values of the combination of AddressLine1, AddressLine2, and City, and finally the number of unique values of the combination of AddressLine1, AddressLine2, City, and Postal Code.

The important thing to note is that just because two columns are in the same statistic doesn’t mean the density can be used. In this example the columns AddressLine1 and PostalCode are both represented in forth density value but this line is not the combination of AddressLine1 and PostalCode only. That combination doesn’t exist in these statistics. If you need that combination you would have to create another statistics object.

Density if a great tool for calculating cardinality estimates and is used often by the query optimizer but there is a problem. This is a good graphical representation of what the problem is.

Each of these columns has the same density even though the different values (red and green) are only distributed evenly (i.e. the same number of each not the fact they are ordered red to green) in one of the columns. Estimates caluculated based on density be grossly inaccurate if the data distribution is skewed. In this case there skew isn’t significant because there are only ten rows. How about the case where there are 100,000 rows and only one is read. The estimate would be 50,000, wildly over for one and wildly under for the other.

To get much, much better estimates SQL Server uses histograms which will be the topic of the next post.

Statistics (Part 1) – What would you say you do here?

This is the first post in a series of posts I will be doing about SQL Server statistics.

As I got more serious about learning SQL Server one of the things I had a hard time wrapping my mind around was statistics. In the abstract I could understand what they were for but beyond that I didn’t understand how they worked. Now that I’ve learned a lot about them I figure others could use some simple instructions on what they are. Hopefully this is simple enough.

Stats are the unsung hero of query optimization. With good stats you can get a well optimized query. With bad stats…not so good. I’ve seen 100x performance gains from implementing good statistics.

Statistics help the query optimizer determine the best execution plan to use to fulfill a query by predicting the number of rows that will be returned from operations. Knowing the number of rows that will be returned by different operations the optimizer can choose the combinations of operators that will best (lowest cost – quickest, less I/O, less CPU) fulfill the query.

The best way to explain this is with an example. I’m going to create a table with three columns, an IDENTITY column, a char(1) column and a varchar(10) column. There will be 10,000 rows. In 9999 of them the character column will be the value “A” while 1 of them is the value “B”. The varchar(10) column is some throw away data. I will put an index on the character column, Col1. The index will automatically create statistics. Then I can run a couple of queries against Col1 to see what types of query plans are generated.

CREATE TABLE dbo.StatTest
(ID int IDENTITY(1,1) NOT NULL,Col1 char(1),Col2 varchar(10))

CREATE CLUSTERED INDEX IC_StatTest ON dbo.StatTest(ID)

CREATE INDEX IX_StatTest ON dbo.StatTest(Col1)
GO

INSERT INTO dbo.StatTest (Col1,Col2) VALUES ('A','ABCDEFGHIJ')
GO 9999

INSERT INTO dbo.StatTest (Col1,Col2) VALUES ('B','ABCDEFGHIJ')
GO

The first test will be this query with the predicate searching for records where Col1 matches “A”.

SELECT * FROM dbo.StatTest WHERE Col1 = 'A'

This is the query plan produced. It is a clustered index scan.

Looking at the properties of the clustered index scan operator it shows and estimate of 9999 rows to be returned.

However with this execution plan of this query using a different predicate results in a different query plan.

SELECT * FROM dbo.StatTest WHERE Col1 = 'B'

This plan uses an index seek/bookmark lookup combination. The optimizer estimated that 1 row would be returned for this value. In the previous query an index scan was the optimal way to retrieve all of the rows where Col1 = A. In the second query and index seek and book mark lookup are the optimal.

It was the statistics that enabled the optimizer to craft a query plan that was optimal for each data condition. Without the statistics the optimizer wouldn’t have had any idea how many rows would be returned or what the best method of accessing the data was.

The consequences of using the wrong plan are fairly noticeable. Imagine if the query optimizer had no way of knowing there was only one value of ‘B’ in Col1 and chose to do a table scan instead of a bookmark lookup.

By using SET STATISTICS IO ON we can see how may logical reads were required to fulfill this query when using a bookmark lookup.

Now forcing it to use a table scan instead.

The table scan required 41 physical reads (physical reads and read-ahead reads) while the bookmark lookup required only 3. The number of physical reads to perform the scan is indefinite and related to the size of the table. The number of physical reads to perform the bookmark lookup may increase slightly as the size of the index grows (and the B-tree grows deeper) but nothing like the number a table scan will need. Why use a table scan to find a single record when a key lookup is faster? Likewise why use key lookups to lookup up multiple records when a table scan is faster? It is statistics that helps the optmizer to determine the best method to access the data.

In the next post I look at the statistics object and start to examine how the optimizer uses it to get cardinality estimates.

SQLskills Immersion Training – Day 5, Benchmarks, Baselines and Using Extended Events

Last day.  Very tired. Hard to pay attention because of the lack of sleep and muscle atrophy from sitting all day for a week.  SQLskills needs masseuse on staff. Go here. Read what is in modules 11 and 12.  That’s what we did.  It was good.

Update 8/23/2012 – Here are links to each day:
Day 0
Day 1
Day 2
Day 3
Day 4
Day 5

SQLskills Immersion Training – Day 4, Index Analysis and Memory Performance Tuning

An unlimited supply of bacon is a dangerous thing.  I open up the lid on a steel container and bacon is there.  I come back 10 minutes later and the remaining bacon has reproduced. Pork is the paterfamilias of the non-vegetable proteins and bacon that replicates is the consul of pig products.

This was the best day of training for me so far.  SQLOS and memory usage is one of my favorite areas of study and Jonathan’s coverage of the topic was excellent.  I have a pretty good framework to work from having spent a lot of time studying it and the information I got today filled in some holes in my knowledge (and opened some new ones).  On Tuesday I worried that the shotgun approach to the material wasn’t good enough without some support information on how to use that material. The supporting info is what we got today.

I do feel sorry though for anyone who hasn’t taken the time to get a foundation in SQLOS and SQL Server memory management.  That was a lot of information and I don’t know how people absorbed it.  That is one of the problems when talking about a topic that is so broad where to understand A you have to understand B but to understand B you need to also understand A.  You have to start somewhere.

Kimberly completed the module on plan cache analysis and index usage.  It was good but I was distracted by the bacon supply.

Update 8/23/2012 – Here are links to each day:
Day 0
Day 1
Day 2
Day 3
Day 4
Day 5

SQLskills Immersion Training – Day 3, Query Plan Analysis, CPU Performance Tuning, Plan Cache and Index Analysis

This was my favorite day of training so far.  Most of the topics I am already very familiar with.  This isn’t a waste of time though. My learning style works by constructing a large framework of basic information then filling in that framework with more detailed and esoteric info.  I already had a very solid base in these areas and the training today helped fill in some of the blanks.

Yesterday Joe began a module teaching about query plan analysis. He continued that today teaching about the more common operators, the different join types and when they are best used and which operators may indicate you are having performance problems.  He went into depth on hash algorithms, MAX DOP and parallelism, cardinality estimation problems, and much more.  The last slide of the module was his “watch” list of things he looks for when analyzing query plans.  I’m happy to say that a lot of what I looked for is already on the list.  I’m also happy to say that I learned a few more things to look for that I hadn’t considered before or had forgotten.

Getting repetition and reinforcement of things already learned is another reason these classes are good.  Most of what is being taught, 95% at least, I’ve been exposed to before either though training classes, videos or whitepapers. But a lot of it doesn’t get absorbed the first or second time.  The repetition helps to solidify that in memory while getting some new tidbits to add to it.

Joe taught about how to troubleshoot CPU performance and did an introduction to resource governor.  This can tie into the previous module because in many cases the causes of high CPU utilization can be discovered from query plan analysis.

To end the day Kimberly began teaching a module on plan cache and index analysis.  This is a good follow up on the same material from IE1.  I took that class last year and don’t remember exactly what was taught but there appears to be lot of repeat information from IE1. It is hard for me to say what is different since it has been a year.  The info is still very good though. The repetition is obviously good since I have a ton of notes.

Like the previous modules I discovered I was already doing a lot of things they recommend but added to my knowledge and a few new tricks.  That statement appears to be the broken record of the blog this week.

Update 8/23/2012 – Here are links to each day:
Day 0
Day 1
Day 2
Day 3
Day 4
Day 5

SQLskills Immersion Training – Day 2, Waits and Queues, Extended Events and Query Plan Analysis

Jonathan finished his introduction to SQLOS with a fire hose introduction to memory management.  This is a topic I’ve been studying for the past two years. It is not easy and there is a lot to learn.  Using various sources from the interweb I’ve developed a methodology for investigating memory pressures (internal, external, VAS) and was hoping to get some more ammunition. I was initially disappointed because this seemed to simply be a collection disconnected facts and lists of DMVs but with no information about how to detect memory pressure or troubleshoot memory problems.

However, if I peek ahead in the book I see that there is an additional section on troubleshooting memory pressures, specifically internal, external and VAS.  Excellent! This is what I would have expected to see.

Paul taught about the Waits and Queues methodology of performance tuning. This included an introduction to thread scheduling in SQLOS, latches and spinlocks.  He gave some very detailed information about the four most common waits and what those really mean.  I’ve been using the Waits and Queues methodology for a couple of years now so this wasn’t new to me but there was a lot of extra information to help round out how to use this methodology better and how to not chase red herrings.

Following the Waits and Queues section was a section about Extended Events.  In short, it was a good intro.  Given how broad this topic is there is only so much one can get in a single 2 hour session.

Having some pretty good knowledge of SQLOS memory management and Waits and Queues but almost no knowledge of Extended Events I was able to gauge how well the material targets a class with a broad range of experience.  The Waits and Queues module gave me a good bit of info to add to what I already knew but also appeared to be a decent introduction for those who don’t already use the methodology.  The Extended Events module gave me a good introduction to the topic but likely had some advanced information for those in the class who have been using them for a while.  The ability to target this broad range of experience in a class is a sign of a well-developed curriculum, I believe.

The last module of the day was the first this week taught by Joe, Query Plan Analysis.  Once again I’m pleased to see that the methodology I use in analyzing query plans is much like the one Joe presented. We ran out of class time and this topic will continue on Day 3 where Joe will get into some of the more common operators.

For bonus content Glenn Berry gave an hour long presentation about how to choose the right SQL Server hardware. He talked specifically about which processors to choose, and possibly more important, which processors not to choose.

It was a long day but packed with excellent information.

Update 8/23/2012 – Here are links to each day:
Day 0
Day 1
Day 2
Day 3
Day 4
Day 5

SQLskills Immersion Training – Day 1, SQL I/O and SANs

Quick notes:

• The food is excellent and more than you can possibly want.
• The entire crew of SQLskills is here even though only Paul, Kimberly, Jonathan and Joe are teaching.
• The group knows how to have fun.  I wouldn’t have paid to come here just to watch people have fun. I can go to the comedy club for that.  But if I can get quality instruction with or without fun I’ll choose fun.
• Paul offered that once you take a class with SQLskills you have lifetime rights to email him about a problem.  Having taking a class with SQLskills I can attest that this is true.  I emailed Paul and Jonathan within the last year about two different problems and both were willing and able to help me.  Naturally, there will be a limit on how much they can do for free but they definitely pointed me in the right direction.
• The dedication to your education is amazing.  They emphasized that if you have any question then you should definitely ask it.  And if the person you ask doesn’t know it then they will point you to someone on the crew who does.  Once again I have personal knowledge that this is true.  Last year I asked Paul a question about SQL Server memory.  He directed me to email Jonathan and within an hour Jonathan had replied with more information than I expected and 100% answered my question.

Here is the list of topics covered in this week’s training.  We covered modules 1, 2 and 3.

Last year in IE1 Paul did a brief lesson about SQL I/O.  I wanted more info about that and was disappointed I didn’t get it.  By mid-morning today I cried ‘uncle’.  I didn’t get all of the I/O information I wanted but I got more than I could absorb and links to whitepapers to read after my brain cache has been flushed.

Then Jonathan took over to talk about SSDs, RAID, SANS, storage, and related topics.  When I heard that it wasn’t just Paul and Kimberly teaching this year I was initially concerned that I wouldn’t be getting the best instruction.  That concern was misplaced.  Jonathan clearly had the knowledge to teach these topic and the impact of each on SQL Server.  I know now why they hired him.  Quote of the day from Jonathan about SQL Server performance on SANs, “SAN vendors are getting better about how they lie to you.”  It was funny. Maybe you had to be there. Maybe were all a bunch of geeks.

If you are already a SAN guy this would likely be basic info. If not this is a great introduction. An added benefit is getting to hear a bunch of SAN guys talking about SANs.  The class members already experienced with SANs engage Jonathan with questions and discussions. This adds to the education because it puts the information Jonathan is displaying on the screen into a real life scenario.  Very useful.

All-in-all a very good day.

Update 8/23/2012 – Here are links to each day:
Day 0
Day 1
Day 2
Day 3
Day 4
Day 5

SQLskills Immersion Training – Day 0, Getting There

The flight to the west coast was a bit of a drag. They don’t design airplanes for short people.  Since I’m tall it’s even worse.  I’m not afraid to fly but the thought of my body’s dissemination into the atmosphere is, well, disturbing.  Fortunately the captain had a Sullenbergeresque mustache and I was much comforted.

The first leg was fine.  On the second leg the guy in front of me immediately reclined his seat and hunched over his tray table to watch a movie.  I didn’t say a word, naturally.

Seated beside me, for four hours, was an expert on whole body health. I learned how the red dye in my M&Ms was slowly poisoning me, that milk was slowly poisoning me, that my Coca-Cola was, you guessed it, slowly poisoning me.  She whipped out her iPad and watched a new age documentary.  Then Avatar.  When she stretched and yawned it was apparent she wasn’t wearing deodorant.  I didn’t get a lecture about how the chemicals in my deodorant were killing me.  Small mercies and all that.

After landing at Sea-Tac she asked me if the flight was long for me.  Four hours squished into a space that isn’t designed to fit someone 4 inches shorter than me while being lectured about the lethality of candy corn, wishing I had bought that pair of swimmer’s nose clips yesterday.  “Yes”.

“That is why it was long. Life is what you make of it”.

I wish I were kidding. I looked behind me to see if I was on the Tony Robbins version of Candid Camera.

The other person in the row was quiet the entire time.  So, to bring this post back to SQL Server, my stats tell me that 50% of the people I meet are going to tell me that the leather in my shoes is going to make my feet fall off while the other half will ignore me.

Maybe I should increase my sample size.

Update 8/23/2012 – Here are links to each day:
Day 0
Day 1
Day 2
Day 3
Day 4
Day 5

Immersion training with SQLskills

So, next week I go to Seattle for a week of training with Paul Randal, Kimberly Tripp, Jonathan Kehayias and Joe Sack. This will be my second training class with SQLskills. Last year I attended their IE1 class and it was some of the best training I’ve had. I highly recommend their immersion classes to any SQL Server professional.

Admittedly, I haven’t been looking forward to it even though I expect it will be outstanding. First, I’m tall and flying absolutely sucks.

Here is the legroom I didn’t have on my last flight. Notice my right knee actually is longer than the amount of leg room available.

This is before the considerate person in front of me decided she needed a nap.

Second, they are three hours behind me and I get up early anyway. On Seattle time I will be wide awake at 3AM. Training doesn’t start for another 5 hours about the time I’m ready for lunch. At least breakfast will be good. It was last year anyway.

At any rate, this is one of those posts I’m putting up to shame myself into blogging every day about the training. I’m sure I will be tired at the end of each day (the classes are intense and densely packed with information.) Your brain will hurt at the end of the day.

So, assuming my plane doesn’t auger in or that I don’t get diverted into a romantic get-away from an overly tactilian (if that isn’t a word it should be) TSA agent I’ll be putting up some posts about my training experiences next week.

Update 8/23/2012 – Here are links to each day:
Day 0
Day 1
Day 2
Day 3
Day 4
Day 5