Manage Your Transaction Log VLF Count

There have been numerous articles written about the problem of too many or too few VLFs in your transaction log. So many that I’m not going to rehash the problem in this post. Go read these links if you haven’t heard of this issue.

Too Many VLFs @ MSDN.
Transaction Log VLFs – too many or too few @ SQLskills
High Virtual Log File (VLF) Count @ Brent Ozar Unlimited

The VLF Count/Growth Problem

Resizing the log to contain the “right” number of VLFs isn’t hard but it does require a bit of math. The problem is that the number of VLFs created when you grow a log file isn’t linear. Any growth between 1MB and 64MB will contain 4 VLFs. All growths between 64 and 1GB will contain 8VLFs. And a growths greater than 1GB will contain 16 VLFs. Imagine you want your log file to be 5000MB. If you direct a single growth of 5000MB you will get 16VLFs. If you grow it at 100, 50MB growths you get 400VLFs. It is necessary to do intermediate growths to get the file size you want and the number of VLFs you want. The difficulty is that intermediate growths can give 4, 8 or 16 VLFs per growth depending on the size of each intermediate growth.

The logic to do this isn’t hard but it is iterative and a bit time consuming. I was faced with having to resize several hundred log files and decided that I didn’t want to do the iterative logic required for each one. So, I wrote a script to do it for me.

Choosing How to Grow to Get The VLFs You Want

Categorically we can think of there being three different growth algorithm and we need to figure out which one to use for our given log size and VLF requirements. So the script needs a little logic to figure out the number of growths and the size of each. This little example explains the logic of the script when we want a 1500MB log file with 80 VLFs.

If I want 80 VLFs I can divide this by the number of VLFs created with each of the different growth algorithms.

80 /4 = 20 file growths
80/ 8 = 10 file growths
80/ 16 = 5 file growths

Take the total size and divide by the number of growths required for each growth algorithm to get the number of MB per growth.
1500/20 = 75 MB/growth (for 20 file growths)
1500/10 = 150 MB/growth (for 10 file growths)
1500/5 = 300 MB/growth (for 5 file growths)

Now we can simply pick the growth algorithm which lines up with reality.

Twenty file growths at 75MB each would create 160 VLFs.
Ten growths at 150MB would create 80VLFs.
Five growths at 300MB would create 40 VLFs.

So we need to do ten growths at 150MB each to get the requisite 80 VLFs.

The link to Brent Ozar Unlimited recommends shrinking the log file and regrowing it. When you shrink the log you will still be left with a few small VLFs. Those first few VLFs will never go away. When you use the script to grow the log file you will always have a few more than the number you specific. This isn’t a problem. The best number of VLFs isn’t an exact science; you just want to be in the ball park of what is reasonable. The script doesn’t shrink the log first though. You have to do that on your own.

I hope you find this script useful. I tested it on SQL 2012 and have used it on SQL2008R2 with excellent results. YMMV.

Double click anywhere on the script to select the entire thing so you can copy it to your clipboard

--@ 2014 Dale Burnett
--http://daleburnett.com/2014/02/manage-your-transaction-log-vlf-count/
--Permission is granted to use this script subject to the following
--conditions:
--The above copyright notice and this permission notice shall be
--included in all copies or substantial portions of the script.


-- growing the log file 
-- <= 64MB -4 VLFs
-- = 64MB to <= 1024MB - 8 VLFs
-- > 1024 MB - 16 VLFs

DECLARE @VLFs int --use a value divisible by 16
DECLARE @LogGrowSize float --in MB
DECLARE @MBVLF float
DECLARE @GrowthFactor tinyint
DECLARE @Grow4Size float
DECLARE @Grow8Size float
DECLARE @Grow16Size float
DECLARE @Grow4Growths int
DECLARE @Grow8Growths int
DECLARE @Grow16Growths int
DECLARE @GrowSizeToUse int
DECLARE @NumberOfGrowths int
DECLARE @DBName sysname
DECLARE @LogFileName sysname
DECLARE @CurrentSizeMB bigint
DECLARE @NewSize bigint
DECLARE @Exec bit
DECLARE @Msg varchar(50)

--edit these control variables to your liking
--don't forget to change @Exec to 1 to 
--actually cause the file growth to occur. 
SET @VLFs = 96 --the number of VLFs. Use a number divisible by 16
SET @LogGrowSize = 1000 --Size in MB to grow 
SET @DBName = 'TestDB'  --The database whose log you want to grow
SET @Exec = 0  --set to 0 to see what the growths will be.

--growths below 1MB create 1, 2 or 3 VLFs depending on the size.  
--I didn't handle these conditions
IF @LogGrowSize < 1
BEGIN
    RAISERROR('This script does not work for growths less than 1MB',16,1)
    GOTO TheEnd
END

--make sure there is only one log file
IF (SELECT COUNT(*) FROM sys.master_files 
    WHERE database_id = DB_ID(@DBName) and type = 1) > 1
BEGIN
    RAISERROR('This script works on databases with only one log file',16,1)
    GOTO TheEnd
END

--Get the logical name of the log file to use later
SELECT @LogFileName = name 
FROM sys.master_files 
WHERE database_id = DB_ID(@DBName) and type = 1


IF @VLFs % 16 <> 0 
BEGIN
    RAISERROR('Desired VLF count must be divisible by 4, 8 and 16',16,1)
    GOTO TheEnd
END


--Get the number of growths needed for each growth 
--pattern (4, 8 or 16 created for each growth)
SET @Grow4Growths = @VLFs/4
SET @Grow8Growths = @VLFs/8
SET @Grow16Growths = @VLFs/16


--Now get the number of growths required for each algorithm
SET @Grow4Size = @LogGrowSize/CAST(@Grow4Growths as float)
SET @Grow8Size = @LogGrowSize/CAST(@Grow8Growths as float)
SET @Grow16Size = @LogGrowSize/CAST(@Grow16Growths as float)

--decide which growth pattern will get us the desired log size and VLFs.
IF @Grow4Size <= 64.0
BEGIN
    --can't use decimals in the SIZE clause of ALTER DATABASE and 
    --since it breaks on =64 CEILING doesn't risk putting us 
    --into a higher VLF growth rate
    SET @GrowSizeToUse = CEILING(@Grow4Size) 
    SET @NumberOfGrowths = @Grow4Growths
    SET @GrowthFactor = 4
END
IF @Grow8Size > 64 AND @Grow8Size <= 1024
BEGIN
    --can't use decimals in the SIZE clause of ALTER DATABASE and since 
    --it breaks on =1024 CEILING doesn't risk putting us into a higher 
    --VLF growth rate. If we had a value of 64.1 there is the danger of 
    --rounding down to 64 which would drop us into a
    --lower VLF growth rate so using CEILING instead of ROUND.
    SET @GrowSizeToUse = CEILING(@Grow8Size) 
    SET @NumberOfGrowths = @Grow8Growths
    SET @GrowthFactor = 8
END
IF @Grow16Size > 1024 
BEGIN
    --See above notes about the use of CEILING
    SET @GrowSizeToUse = CEILING(@Grow16Size)
    SET @NumberOfGrowths = @Grow16Growths
    SET @GrowthFactor = 16
END

--see what the growth size and number of growths will be for each algorithm
SELECT @Grow4Size [Grow4Size] ,@Grow4Growths [Grow4Growths], 
       @Grow8Size [Grow8Size], @Grow8Growths [Grow8Growths], 
       @Grow16Size [Grow16Size], @Grow16Growths [Grow16Growths]
--The parameters the scipt has chosen to use
SELECT @GrowthFactor [GrowthFactorToUse], @GrowSizeToUse [GrowSizeToUse_MB], 
       @NumberOfGrowths [NumGrowths], @VLFs [VLFs], 
       CAST(@GrowSizeToUse as float)/@GrowthFactor [VLFSize_MB]


--grow the log in the necessary increments
DECLARE @i int = 0
DECLARE @stmt varchar(max)
WHILE @i < @NumberOfGrowths
BEGIN
    --We have to take the current size and add the growth size to get a new 
    --size to pass into ALTER DATABASE. The current size may not be 
    --divisible on 1MB boundaries.  Ex. A size of 150 pages would be 1.17MB. 
    --We need to round up or down to a MB boundary so the current size can 
    --be added to the growth size to get the new size.  I chose to round down. 
    --The conversion of this to a bigint should round this down to the 
    --lowest MB boundary.
    --Regardless of rounding up or down because of this the VLFs in the first 
    --growth may not be equal size to the other new VLFs.
    --It should, however, be close enough.  This doesn't have to be 
    --an exact science.
    SELECT @CurrentSizeMB = (size * 8/1024) 
    FROM sys.master_files 
    WHERE database_id = DB_ID(@DBName) and type = 1
    
    SET @NewSize = @CurrentSizeMB + @GrowSizeToUse
    SET @stmt = N'ALTER DATABASE ' + @DBName + N' MODIFY FILE (NAME = ''' + 
    @LogFileName + N''', SIZE = ' + CAST(@NewSize AS nvarchar(20)) + N' MB)'
    IF (@Exec = 1) EXEC(@stmt)
    SET @i = @i + 1
    SET @Msg = 'Log growth ' + cast(@i as varchar(10)) + ' is complete.'
    RAISERROR (@Msg, 0, 1) WITH NOWAIT
END



IF @Exec = 1
BEGIN
    SELECT size [size_in_pages],size * 8/1024 [size_in_mb] ,* 
    FROM sys.master_files 
    WHERE database_id = DB_ID(@DBName) and type = 1

    DBCC LOGINFO
END

--where we end up if there is an user defined error
TheEnd:

Everything You Ever Wanted to Know About SET NOCOUNT

SET NOCOUNT has to be one of the most used T-SQL statements. I can’t think of a single production script or procedure in which I didn’t use this SET statement. And I have to stretch to come up with a case when I would specifically not use it. The only time I don’t use NOCOUNT is when writing off-the-cuff test scripts. I decided to dig into it a bit and see what it really does behind the scenes.
countvonnocount2b
From BOL:
“Stops the message that shows the count of the number of rows affected by a Transact-SQL statement or stored procedure from being returned as part of the result set.

SET NOCOUNT ON prevents the sending of DONE_IN_PROC messages to the client for each statement in a stored procedure. For stored procedures that contain several statements that do not return much actual data, or for procedures that contain Transact-SQL loops, setting SET NOCOUNT to ON can provide a significant performance boost, because network traffic is greatly reduced.”

I learned that the first statement isn’t as clear as it could be. I think it is technically true but easy to misunderstand. It appears to say that it stops the count from being returned in a result set. This is not true. The result still returns the count for statements that return data to the client. However, this count is marked as not to be used.

The second account is a bit misleading. It does stop the DONE_IN_PROC messages (tokens, actually, which I will explain later) from being sent. The reduction in network traffic is very case specific. Now it is always better to have less network traffic than more but I can think of only a few cases where a significant difference would be noticed.

Examining this requires a bit of understanding of Tabular Data Streams (TDS). I’m not interested in writing a series of columns about TDS at the moment so I’m going to keep this intro as short as possible. One of the great things about TDS is that we don’t have to know a lot about it ’cause it just works.
Continue reading

REPLICATE

While creating some CLR user defined functions I discovered that using data passed into my nvarchar(max) parameter was getting truncated if it was more than 8000 bytes.

For about a day and a half I suspected that the C# code was not set up to properly handle LOBs and that it was getting truncated. I tried every solution I found online and nothing helped.
pullhairout

I was creating the LOBs using the REPLICATE function. My last resort was to make sure I really was passing in more than 8000 bytes which lead me to read the documentation for REPLICATE.

It says:

If string_expression is not of type varchar(max) or nvarchar(max), REPLICATE truncates the return value at 8,000 bytes. To return values greater than 8,000 bytes, string_expression must be explicitly cast to the appropriate large-value data type.

There was my problem.

Test code:

DECLARE @string nvarchar(max)
DECLARE @result nvarchar(max)

SET @string = N'A'

SET @result = REPLICATE(N'A',4001)
SELECT LEN(@result)

SET @result = REPLICATE(CAST(N'A' as nvarchar(max)),4001)
SELECT LEN(@result)

SET @result = REPLICATE(@string,4001)
SELECT LEN(@result)

Results:
ReplicateLOBResults

This is the link to the request for assistance on SQL Server Central if you are interested in the details of the LOB parameters to CLR functions.

Hair Pulling Photo Attribution: By stewartpilbrow under CC BY-SA 2.0

What is SQL Server?

When asked what I do for a living my answer befuddles many. Outside of the information technology realms most people have not heard of SQL Server. If they have they don’t know what it is or what it does. “I’m a SQL Server developer” causes many to cock their head sideways like dog presented with a new noise. I used to give a vague answer that I work in computers or information technology but I decided I wanted to be more specific. You never know when the next guy you talk to is going to introduce you to a great opportunity. And I admit that I sometimes enjoy the befuddled looks. I don’t have a dog right now.

An amusing exercise is to try to get the answer to this from Microsoft’s web site. You might think that if you Google or Bing the question, “What is SQL Server” the top result would be a link to Microsoft with a fantastic explanation of what it is. You might be wrong. It is the fourth result on Google and the third result on Bing.

If a prize existed for the most obfuscated way to describe your products Microsoft would be a top contender. By visiting the SQL Server site at Microsoft I learn that SQL Server will “unlock breakthrough insights across the organization” and that it will provide me with “data warehousing and analytical solutions backed by IT management and insights.” That’s two uses of insight in two paragraphs. Word offers this nifty feature for suggesting synonyms. That would have been a great feature to have for just about every paper I wrote in college. I suppose there is a remote possibility the people who wrote the marketing literature for this have ever heard of Word. At any rate, at least the word data does appear on the page…once. The product page is filled with a lot of “why” and not a lot of “what”.

The following is a compilation of many of the conversations I’ve had with people when the topic comes up.
What is SQL Server?

It is used to store data.

You mean my grocery list or phone numbers?

Well it can be used for that but…not exactly.

Does it store Sea Quills?

You know that when you fill out a form online that includes your name, address and other information the company you send that to has to put that information somewhere.

I never really thought about it.

Yeah, I used to tell people that the Submit button was actually a Print button in disguise that caused the form to print at company headquarters but too many people believed me.

I would never have believed that (he says with a sideways glance and face askew).

You figure that there are thousand if not tens of thousands of people who are filling out that form. Not only does the company have to store that information they have to do so in a way that makes it easy to get back. If I wrote 10,000 names, addresses and other information about each customer into a Word document it would be very cumbersome to manage. SQL Server allows a company to store the information in a way that makes it easier to manage and easier to retrieve if they need to view it.

So it goes into SQL Server?

SQL Server is one of the many types of products that can be used to manage this data. It is made by Microsoft. There are other products by other companies.

So why SQL Server

I guess it has to be because of the insight.

Statistics (Part 8) – Nulls

This is a continuation of my blog series about statistics. The previous post is here.

So the question is what happens to your cardinality estimates when the column is nullable and contains many nulls (or is it more accurate to say doesn’t contain…regardless I think you know what I mean)?

Under this condition the face value of the density in the density vector is not used but adjusted to account for the presence of nulls in the column.

In part 2 of this series I created a table called AWStatTest which contains a subset of the columns from the Person.Address table in the Adventureworks database, AddressLine1, AddressLine2, City, and Postal Code.

The column AddressLine2 is nullable and, given the nature of addresses the second address line doesn’t always contain data. This is the column I will be using for my test.

The first step is to create a statistic object on this column which I will do so with this statement. I got into the habit of using full scans in my examples because it is often easier to understand how cardinality estimations work when the numbers predicted by the stats match the actual numbers in the table. This isn’t a blanket recommendation for full scans though.

CREATE STATISTICS Addressline2 on dbo.AWStatTest(AddressLine2) WITH FULLSCAN

And to see the stats it produced.

DBCC SHOW_STATISTICS ('dbo.AWStatTest','Addressline2')

Recall from part 2 of this series that density is calculated by the formula 1/(number of unique values). The number of unique values in the column can found with this statement.

SELECT COUNT(*) FROM
(
SELECT DISTINCT AddressLine2 FROM AWStatTest
) A

In our case the value returned is 39. So, 1/39 = .02564103 which is the density shown in the density vector above.

In order to test cardinality estimation under these conditions I will need to use a predicate which causes the optimizer to use the density value for estimation instead of the histogram. Comparing against a variable instead of a static value will force this.

DECLARE @Addressline2 nvarchar(60)
SET @Addressline2 = '# 10'
SELECT * FROM dbo.AWStatTest WHERE AddressLine2 = @AddressLine2

Based on the knowledge gained in previous posts I would expect the cardinality estimate to be calculated from the density vector using the formula (Number or rows in the table) * (Density) or in our case (4040008 * .02564103.) which would be 103590 rows.

But the estimate produced by the optimizer is quite different: 1275.79

My next guess would be that the optimizer is using a magic density. But that formula (4040008^.75) produces an estimate of 90112.84, still way off from the estimate the optimizer produced.

If the database option ANSI_NULLS is ON then a null comparison via an equality statement will never resolve to true. Even a comparison of NULL = NULL does not resolve to true. Given this, the only way our predicate AddressLine2 = @AddressLine2 can be true is if the value of @AddressLine2 is not null and the value of this field in the row is also not null. In other words any row which has AddressLine2 of null would never be returned by this statement even if the value of @AddressLine2 is null.

So, what if we recalculated density and cardinality but disregarded nulls. There would be only 38 unique values in the columns instead of 39 so the density would be 1/38 or .02631578. The histogram shows there are 3991528 null values in AddressLine2. The total number of rows in the table minus the number of null entries should give us the number of non-null values in the column 4040008 – 391528 = 48480.

Using the new density and table row count to calculate cardinality would give us,
48480 * .02631578 = 1275.79, the same cardinality estimate produced by the optimizer.

This is with ANSI_NULLS turned on. What if ANSI_NULLS is disabled? With ANSI_NULLS a comparison of NULL = NULL would resolve true and all of the nulls in the column could be returned when using the equality predicate construct.

SET ANSI_NULLS OFF

Running the same query again with ANSI_NULLS off produces a cardinality estimate of 103590 which is the estimate we came up with using the original density and full table row count.

Neat, huh?

Statistics (Part 7) – Magic Density

Sometimes the optimizer has absolutely no idea how to estimate cardinality. There are either no statistics and auto-create statistics has been disabled or the predicate isn’t of a form that SQL Server can use any of the existing statistics. There was an example of the latter problem in my last blog post about stats on computed columns. On these occasions the optimizer must be like Spock.

SPOCK
In order to return us to the exact moment at which we left the 23rd Century, I have used our journey back through time as a referent, calculating the coefficient of elapsed time in relation to the acceleration curve.

BONES
Naturally.
So what is your problem?

SPOCK
Acceleration is no longer a constant.

BONES
Well, you’re gonna have to take your best shot.

SPOCK
Best shot?

BONES
Guess, Spock. Your best guess.

SPOCK
Guessing is not in my nature.

BONES
Well nobody’s perfect.

Unlike Mr. Spock guessing is definitely in the nature of the optimizer and when it has no other way to estimate cardinality it will simply guess. Well, it is always guessing but most of the time the guess is educated, informed by stats. Without information to go by the guesses are so unbelievably bad you’d think you’d stumbled into a screening of Highlander 2: The Quickening. There are some rules it has to go by but ultimately it is taking completely uneducated guesses. However I’ve found that knowing how it is going to guess has been helpful in discovering that the optimizer is guessing. To my knowledge there is no flag or other notification that it has taken a complete WAG. And knowing that it is guessing helps track down why we have crappy query plans and points us in a direction where we might be able to fix them.

For years I’ve read that if the optimizer doesn’t have statistics to go on it will estimate 10% of the rows in the table for straight equality searches.
Consider this query.

SELECT * FROM dbo.StatTest WHERE Col1 = 'A'

On a table with 12,000 rows this would be estimated at 1,200 rows returned. In preparation for this series of blog posts I tested everything I was asserting before putting it on the page. The only time I could come up with a 10% of row count cardinality estimate was with 10,000 rows on the table. I also could not figure out what percentage was being used. If I changed the table size the percentage changed. At that point I suspected that the cardinality estimate was logarithmic in nature and changed based on the size of the base table.

So I created many test tables with each with a different number of rows, 1000, 2000, 3000, 4000, 5000, 10000, 11000, etc. I plotting the numbers of rows estimated on a graph and, with the help of someone who knows what he is doing, was able to figure out the formula that fit the curve.

It turns out that the instead of 10% the estimate is ROWS ^ .75. This explained why my 10000 row table gave me a 10% estimate (10000^.75 = 1000 = 10000 * .10.) It was pure coincidence.

I then tested a query whose predicate queried against two columns.

SELECT * FROM dbo.StatTest WHERE Col1 = 'A' AND Col2 = 'B'

I expected to use the same formula but it didn’t work. Plotting the numbers on a graph showed that the formula is ROWS ^ .6875.

I was about to give up when I ran across this blog post from Ian Jose which showed the factor constant used for cardinality estimates with a different number of equality matches in the predicate. I’m repeating the values here in case that page ever goes away.

1 column – Rows ^ (3/4) = Rows ^ .75
2 column – Rows ^ (11/16) = Rows ^ .6875
3 column – Rows ^ (43/64)
4 column – Rows ^ (171/256)
5 column – Rows ^ (170/256)

175 column – Rows ^ (0/256)

I tested this out to four column equality comparisons and decided that was good enough and I would believe Ian’s numbers.
For other situations the magic densities I’ve read about all were correct. BETWEEN uses 9% of row count and <, >, <=, =>, use 30% of row count.

Ultimately if the optimizer is using magic densities it is a good idea to figure that out and create the stats or fix the code that allow it to make educated guesses. Your instance will love you for it.

Statistics (Part 6) – Computed Columns

This topic is unbelievable dry. Dryer than ancient Egyptian pharaoh bone dust dry. Dryer than the surface of the sun dry. The layman might say “you’re talking about SQL Server and you expect it to be interesting?” Fair point. Maybe the topic of SQL Server is dry to the laymen but almost every topic is interesting to someone. Some people collect thimbles or bugs. Some people juggle geese. And as interested in SQL Server as I am this particular aspect of stats has got to be the driest I’ve ever written about. Death Valley dry. Dryer than Aunt Edna’s thanksgiving turkey dry.

However I feel it is necessary to include this information as part of the series. Please pardon me if I try to spice it up a bit. I already shorted out one keyboard from the boredom induced drool. This builds on my previous and hopefully less boring series on stats.

So…stats on computed columns. You can do this. Even if the column isn’t persisted. There is a caveat though. The formula for the column must be deterministic.

What does it mean to be deterministic? Wikipedia defines determinism as “a metaphysical philosophy stating that for everything that happens there are conditions such that, given those conditions, nothing else could happen.” In computer science this means that given the same input you will always have the same output. In a deterministic universe if we went back 14 billion years and reran the big bang I’d still be typing this boring topic and you would have still stopped reading this two paragraphs ago (see what I did there?)

It would be impossible to generate statistics on a column that is created from a non-deterministic expression. Imagine a column that was based on the result of RAND(). Every select would produce a new value on each row. Stats would be wrong the minute they are created. So SQL Server won’t let you create stats on non-deterministic computed columns.

However given the predictability of determinism (this is actually the third time you’ve read this in the last 42 billion years you just don’t remember the first two times) SQL Server is easily able to create statistics on a computed column, even if the computed column isn’t persisted. If the computed column C is the sum of column A and column B then the computed column will always be A + B even if C is never saved on disk.

According to the documentation the optimizer can make estimates even if the predicate contains the formula the computed column is calculated from instead of the column name. I will test that too. Fun.
Continue reading

Statistics (Part 5) – String Index

I’ve been talking about different aspects of stats and cardinality estimates. My last post is hiding here.

The String Index portion of a statistics object is sparsely documented. The existence of the string index in a stats object can be discovered by looking at the header output of DBCC SHOW_STATISTICS.


The documentation from Microsoft often refers to this as the String Summary.

The string index assists in cardinality estimates for substrings in predicates containing the LIKE clause with wildcard characters. It only does this for the leading column of a stats object and only for character data types, i.e. the string index is constructed when the leading column of a stats object is a character column.

The index contains substrings from the first and last 40 characters of a string. If the string is 80 characters long then the string index is created for all characters. If the string is 120 characters long the middle 40 characters would not be included in the string index.

To show this I created a test table with column that is 120 characters long. This column can logically be divided into three sections, the first 40 characters, the middle 40 characters, and the last 40 characters. In each of these sections I create a substring that is unique to that section and does not exist in the other sections. By querying the table with LIKE and wildcards I can see if the cardinality estimates for the middle section are different from the estimates for the beginning and ending sections. In theory the cardinality estimate for the middle section should be the same for predicate whose substring is not part of the string index. I also include several thousand rows that have no noteable substrings at all just to give the table soem mass.

CREATE TABLE StringTest (ID int IDENTITY(1,1) PRIMARY KEY NOT NULL, Col1 varchar(120))
INSERT INTO StringTest (Col1) SELECT REPLICATE('A',120)
GO 1000
INSERT INTO StringTest (Col1)
    SELECT REPLICATE('A',10) +
           REPLICATE('B',10) +
           REPLICATE('A',100)
GO 30
INSERT INTO StringTest (Col1)
    SELECT REPLICATE('A',10) +
           REPLICATE('C',10) +
           REPLICATE('A',100)
GO 30
INSERT INTO StringTest (Col1)
    SELECT REPLICATE('A',120)
GO 1000
INSERT INTO StringTest (Col1)
    SELECT REPLICATE('A',50) +
           REPLICATE('D',10) +
           REPLICATE('A',60)
GO 30
INSERT INTO StringTest (Col1)
    SELECT REPLICATE('A',50) +
           REPLICATE('E',10) +
           REPLICATE('A',60)
GO 30
INSERT INTO StringTest (Col1)
    SELECT REPLICATE('A',120)
GO 1000
INSERT INTO StringTest (Col1)
    SELECT REPLICATE('A',90) +
           REPLICATE('F',10) +
           REPLICATE('A',20)
GO 30
INSERT INTO StringTest (Col1)
    SELECT REPLICATE('A',90) +
           REPLICATE('G',10) +
           REPLICATE('A',20)
GO 30
INSERT INTO StringTest (Col1)
    SELECT REPLICATE('A',120)
GO 1000

CREATE STATISTICS StringTestStat ON dbo.StringTest(Col1) WITH FULLSCAN
DBCC SHOW_STATISTICS('StringTest','StringTestStat')

In the first section there are 30 rows with the substrings ‘BBBBBBBBBB’ and ‘CCCCCCCCCC’, in the second section there are 30 rows with the substrings ‘DDDDDDDDDD’ and ‘EEEEEEEEEE’, and in the last section there are 30 rows with the substring ‘FFFFFFFFFF’ and ‘GGGGGGGGGG’. What we should see are cardinality estimates for the B string and C string similar to the cardinality estimates for the F string and the G string (giggle). But the cardinality estimates for the D and E strings should be similar to cardinality estimates for a substring that doesn’t exist.

Substrings in the first 40 characters:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%BBBBBBBBBB%'

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%CCCCCCCCCC%'

Substring in the middle 40 characters:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%DDDDDDDDDD%'

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%EEEEEEEEEE%'

Substrings in the last 40 characters:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%FFFFFFFFFF%'

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%GGGGGGGGGG%'

Substring that doesn’t exist:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%SSSSSSSSSS%'

As predicted the estimates for the substrings in the first and last sections, the B and C substrings are similar to the estimates for the substrings in the last section, the F and G substrings. However the estimates for the middle section, the D and E substrings are similar to the estimates for substrings that do not exist in the column.

Given that there is so little documentation on this topic this is the best I can do to show how the string index can influence cardinality estimates.

My next post will be about statistics on computed columns.

Statistics (Part 4) – Multi-column cardinality

In the previous posts the predicates for the example queries have all had only one element, i.e. only one constraint against the table in the WHERE clause. While troubleshooting cardinality estimate problems it is helpful to understand what happens when there are two constraints against two different columns in the same table.

Update 2/22/2014:
Before reading this you might want to check out Paul White’s article on this topic where he gives a much better explanation than I did. He also includes the case of OR selectivity which didn’t even occur to me to investigate.

This is the sample query that will be used to show how cardinality estimates happen when there are two constraints.

SELECT * FROM dbo.AWStatTest where AddressLine1 = N'100, rue des Rosiers' and PostalCode = N'2010'

In Post 2 of this series I created a table called AWStatTest and filled it with data for demo purposes. So you don’t have to refer back to that post here is the DDL for that table and the index I created.

CREATE TABLE AWStatTest
(
   ID int identity(1,1),
   AddressLine1 nvarchar(60),
   AddressLine2 nvarchar(60),
   City nvarchar(30),
   PostalCode nvarchar(15)
)

CREATE INDEX IX_AWStatTest ON dbo.AWStatTest (AddressLine1,AddressLine2,City,PostalCode)

The index creation also includes a creation of statistics. This is a picture of the statistics for IX_AWStatTest and will be referenced in this post.

The predicate in the sample query is constraining the search by AddressLine1 and PostalCode. As shown in my post on density, the statistics IX_AWStatTest alone cannot be used for cardinality estimates for queries constrained by AddressLine1 and PostalCode.

There are at least two ways this can be estimated, 1) by using multiple stats or 2) by using a multi-column statistic

Using Multiple Stats
One of the features of SQL Server is the ability to create single column statistics on the fly in order to help out with optimization (if the database is set to auto create them anyway…my database is set up to do so). When I run the query above SQL Server created statistics for the column PostalCode. The name of those statistics is machine generated, _WA_Sys_00000005_5AEE82B9 (this will be named differently on your machine). You can see these stats with DBCC SHOW_STATISTICS.

DBCC SHOW_STATISTICS ('dbo.AWStatTest','_WA_Sys_00000005_5AEE82B9')


This statistics object has a density vector and histogram as well.

SQL Server can use the histogram for AddressLine1 from the statistic IX_AWStatTest and the histogram for PostalCode from the statistic _WA_Sys_00000005_5AEE82B9 to make a cardinality guess. The histogram for PostalCode shows that our constraint value ’2010′ is a RANGE_HI_KEY and therefore EQ_ROWS will be used to provide cardinality estimates for this value. The number or rows in the table estimated to match postal code 2010 is 13714.8. Looking at the histogram for AddressLine1 shows that ’100, rue des Rosiers’ is a RANGE_HI_KEY and EQ_ROWS would give us an estimate of 2020 rows if the predicate included a search for just this value.

So how are these used together? A quick example problem that should be easy for anyone familiar with playing cards to understand which helps us understand using two different histograms to make a cardinality estimate. There are thirteen cards in a suit and four suits per deck of standard playing cards. What is the probablity of drawing a face card that is a heart? In this case we are looking for cards that have two specific characteristics, a face card and a heart. There is a 13/52 chance of drawing a heart. There are 12 face cards (Jack, Queen, King) in a deck, so there is a 12/52 chance of drawing a face card.

The probability of drawing a face card that is also a heart is:
13/52 * 12/52 = .0576923.
If we were to treat this like a density and estimate the number of cards in the deck that would match our double criteria we would multiply .0576923 by the number or cards (rows) in the deck, .0576923 * 52 = 3. This is the exact number or cards matching the criteria we actually have.

Let’s translate this example to our cardinality problem. We are looking for rows of data that have two specific characteristics, AddressLine1 and PostalCode equal to specific values. In this situation the probability of a match is for AddressLine1 and Postal Code are 2020/4040008 and 13714.8/4040008 respectively.

Multiply these together by the number of rows in the table and the cardinality estimation is for the query is: (2020/4040008) * (13714.8/4040008) * 4040008 = 6.85739.

What about three columns:

SELECT * FROM dbo.AWStatTest where AddressLine1 = N'11, quai de l´ Iton' and PostalCode = N'94010' and City = N'Bellingham'

The cardinality estimates for each of these values individual are, 9090, 42062.04, and 58405.71 respectively.

(9090/4040008) * (42062.04/4040008) * (58405.71/4040008) * 4040008 = 1.36819

The examples above all used values that were RANGE_HI_KEYS from the respective histograms. In my tests I have seen that estimates using values from the AVG_RANGE_ROWS column work exactly the same way.

Using multiple statistics objects isn’t the only way SQL Server can estimate cardinality on a table with multiple predicates.

Using a multi-column density
In Post 2 I talked about density vectors and how they can be used. They can also be used for multi-column estimates. To force SQL Server to use a multi-column density for cardinality estimation I am going to create a new table with the same data as the old one, turn off auto creation of statistics and create a statistics object for just PostalCode and AddressLine1. The primary reason I’m creating a new table is because I’ve done so much testing on my other one that I fear being unable to get pure estimates.

SELECT * INTO dbo.AWAL1PCTest from dbo.AWStatTest
CREATE STATISTICS PostalCodeAddressLine1 ON dbo.AWAL1PCTest (PostalCode,AddressLine1)
ALTER DATABASE TestDB SET AUTO_CREATE_STATISTICS OFF

I should end up with a statistics object for PostalCode and AddressLine1 that looks like this:

The density value for the combination of AddressLine1 and PostalCode is .000249066
The estimate should be .000249066 * 4040008 = 1006.23

Running this query:

SELECT * FROM dbo.AWStatTest where AddressLine1 = N'100, rue des Rosiers' and PostalCode = N'2010'

The cardinality estimates is

In this post we have seen that cardinality estimates for predicates that contain multiple matching conditions for two different columns on the same table can be estimated in at least two ways.
1. By using the histogram values from multiple statistics objects
2. By using a multi-column density value

The point of the previous three posts is not to outline all of the ways the density vector and histogram parts of statistics objects are used make cardinality estimates. That information is not publically available (that I can find anyway). Most of what I talk about in the previous three posts comes from performing a lot of tests and comparing results to calculations. The purpose is to highlight how the various components can be involved so that one can have more success in troubleshooting cardinality estimate issues. I’ve often been surprised at some of the estimates that come up and they don’t always align themselves to the “rules” I’ve laid out above, though they usually do. That being said, tracing cardinality estimates back to specific statistics used has been helpful in identifying skewed data distributions and possible places for optimizations.

The next topic is an area that is very opaque and, I suspect, is one of the causes of some of the odd cardinality estimates I’ve seen: the String Index.

Statistics (Part 3) – Histograms

As noted in the previous entry density can be used to predict the number of records that will be returned from a query against a table. This is fine if there is an even distribution of data but what about the cases where there is not.

As demonstrated in the previous post the density for the column AddressLine1 of 0.0003552398 multiplied by the number of rows in the table, 4040008, gives us 1435.17, which is the number of rows the query optimizer estimates will be returned from a query against the table with a single predicate against the column AddressLine1. Given an even distribution of data each unique value in the column would exist in the set roughly 1435 times. SQL Server can use the density vector to estimate the number or rows that will be returned from a query. The caveat here is ‘given and even distribution’. What if the distribution is uneven?

What if you have 10000 rows with 9999 values in a column equal to A and 1 equal to B? This was the example I set up in my first post in this series. In this case the density would be 1/Number of Unique Values or ½ or .5. If we use the technique of multiplying the density by the number of rows in the table we would get an estimate of 5000 rows for each value. This could cause the optimizer to produce a plan that scans the entire table as opposed to the plan that queries an index and does a book mark lookup. If the query is looking for records where the value of AddressLIne1 is A plan produced is OK since it is likely a table scan will be used anyway. But looking for values of B would require a table scan to find a single record. On the flip side using bookmark lookups to retreive 9999 rows when a table scan would require less I/O is not a good idea either. I refer you to my first post in the series where I showed the number of physical reads required from a table scan relative to a bookmark lookup.

SQL Server can use histograms to make more refined predictions.

In my previous post I created a test table with 4040008 rows and an index on the column AddressLine1. The index automatically creates a statisitcs objects. You can see the histogram of the statistics attached to the demo index by using DBCC SHOW_STATISTICS. I use the WITH HISTOGRAM option so that only the histogram is returned.

DBCC SHOW_STATISTICS('AWSTatTest','IX_AWStatTest')
WITH HISTOGRAM

One thing to note about the histogram is that the values of RANGE_HI_KEY are sorted in ascending order. All values not equal to a RANGE_HI_KEY naturally will be of greater or lesser value than the RANGE_HI_KEYs and, subsequently, can be thought of as existing between two keys that are side by side. If our range hi keys are 5, 10, 20, and 25 the value 7 exists between the entries of 5 and 10. While it also exists between the entries of 5 and 20 the next highest entry to 7 is 10. For query optimization the data in the row of the next highest entry is used to for cardinality estimates. In our example the value of ’081, boulevard du Montparnasse’ exists between ‘#500-75 O’Connor Street’ and ’1, place Beaubernard’ when these values are sorted in ascending order.

With this we know then there are two types of estimates that can be made from the histogram. Values equal to the RANGE_HI_KEY and values that are between two RANGE_HI_KEYs which are side by side.

The cardinality estimates for predicates equal to the RANGE_HI_KEY come from the EQ_ROWS column. If we query the table for the value, “1, place Beaubernard” in the column AddressLine1, the optimizer will predict 4040 rows to be returned.

SELECT * FROM dbo.AWStatTest where AddressLine1 = '1, place Beaubernard'

With a full sample the EQ_ROWS value for each RANGE_HI_KEYS show the exact number of times that each of the RANGE_HI_KEYS value is in the index/heap’s column. The values of in the RANGE_ROWS and DISTINCT_RANGE_ROWS are used to calculate AVG_RANGE_ROWS which, in turn, is used to provide cardinality estimates for values that don’t match the RANGE_HI_KEYs.

RANGE_ROWS is the number or records which contain a value that sorts between two side by side range rows, the current row and the next lowest row. In our example range rows for ’1, place Beaubernard’ is 14147. This means there are 14147 records which contain a value between ’1, place Beaubernard’ and ‘#500-75 O’Connor Street’.

We can use this query to test this.

SELECT COUNT(*) FROM dbo.AWStatTest
WHERE AddressLine1 > '#500-75 O''Connor Street'
AND AddressLine1 < '1, place Beaubernard'

DISTINCT_RANGE_ROWS are the number of distinct values between two range hi keys.

SELECT COUNT(*) [count]
FROM
(
SELECT DISTINCT AddressLine1
FROM dbo.AWStatTest
WHERE AddressLine1 < '1, place Beaubernard'
AND AddressLine1 > '#500-75 O''Connor Street'
) AA

If you recall from the previous lesson density is calculated as 1/Unique number of values. So we can calculate the density of the range between two RANGE_HI_KEYS. The density of the range between ’1, place Beaubernard’ and ‘#500-75 O’Connor Street’ is 1/13 or .0769231.

Density times the number of rows gives us the average number of times each value will exist in this range. .0769231 X 14147 = 1088.231 which is the value of the AVG_RANGE_ROWS column for the RANGE_HI_KEY ’1, place Beaubernard’. If we query AddressLine1 for a value that exists in this range the number of rows estimated to be returned should be 1088.231.

SELECT * FROM dbo.AWStatTest where AddressLine1 = '081, boulevard du Montparnasse'

This density estimation between the RANGE_HI_KEYs works just like the density estimation for the entire column but with more precision. It can be more precise because the range is smaller and because as the histogram is built SQL Server can choose range hi keys so that the steps between them have a more even distribution.

These are examples of single columns statistics being used with queries against individual columns. What happens if the query against the table contains two predicates. That is the focus of the next post.