The danger of NOT IN (Or, you keep using that word. I do not think it means what you think it means.)

I hinted about this in my post on anti-joins but I thought it deserved a little more attention, especially since this SQL construct is one I’ve seen used many times and, in the past, I used it quite liberally. The danger is that when doing an anti-join with NOT IN there is a risk that SQL will give you unexpected results.

First it helps to understand what an anti-join is. Basically you are telling SQL server to give you all of the rows from one result set that don’t have a match in another result set. My previous post has more details.

However I’m not going to start with an anti-join example but with an example of IN and NOT IN because this makes the anti-join example easier to understand.

First I need to create some data to work with.

CREATE TABLE dbo.Person
(
Name varchar(20),
AddressID int
)

CREATE TABLE dbo.Address
(
AddressID int,
Address varchar(20)
)

INSERT INTO dbo.Person (Name,AddressID) VALUES ('Wesley',1)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Buttercup',1)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Inigo',2)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Fezzik',2)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Rugen',3)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Humperdinck',3)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Vizzini',4)

INSERT INTO dbo.Address (AddressID,Address) VALUES (1,'Farm')
INSERT INTO dbo.Address (AddressID,Address) VALUES (2,'Theives'' Forest ')
INSERT INTO dbo.Address (AddressID,Address) VALUES (3,'Castle')
INSERT INTO dbo.Address (AddressID,Address) VALUES (null,'Dungeon')

The two tables look like this:
Person:

Address:

The Person table has 7 rows.  AddressID contains the values 1,2,3,4.
The Address table has 4 rows. AddressID contains the values 1,2,3, and for one of the rows this value is null.
Continue reading The danger of NOT IN (Or, you keep using that word. I do not think it means what you think it means.)

Parsing the SET options in sys.dm_exec_plan_attributes

When doing my previous post about plan caching and plan reuse I spent some time looking for how to decipher the set_options attribute of sys.dm_exec_plan_attributes. I couldn’t find anything. Maybe I didn’t look hard enough.

Thankfully, Benjamin Navarez recently authored a post on the effects of SET options on plan reuse and published the very information I had been looking for.

declare @set_options int = 251
if ((1 & @set_options) = 1) print 'ANSI_PADDING'
if ((4 & @set_options) = 4) print 'FORCEPLAN'
if ((8 & @set_options) = 8) print 'CONCAT_NULL_YIELDS_NULL'
if ((16 & @set_options) = 16) print 'ANSI_WARNINGS'
if ((32 & @set_options) = 32) print 'ANSI_NULLS'
if ((64 & @set_options) = 64) print 'QUOTED_IDENTIFIER'
if ((128 & @set_options) = 128) print 'ANSI_NULL_DFLT_ON'
if ((256 & @set_options) = 256) print 'ANSI_NULL_DFLT_OFF'
if ((512 & @set_options) = 512) print 'NoBrowseTable'
if ((4096 & @set_options) = 4096) print 'ARITH_ABORT'
if ((8192 & @set_options) = 8192) print 'NUMERIC_ROUNDABORT'
if ((16384 & @set_options) = 16384) print 'DATEFIRST'
if ((32768 & @set_options) = 32768) print 'DATEFORMAT'
if ((65536 & @set_options) = 65536) print 'LanguageID'

I suggest you go read the whole thing.

Thank you, Benjamin.

Semi-joins and Anti-joins

I orignally started this post for T-SQL Tuesday but wasn’t able to complete it in time. Being late I can see what other wrote about and it appears that semi-joins and anti-joins were a popular topic.  A while back I did a post on joins but didn’t talk about these. Since I’m teaching a class about joins I am finishing it up anyway so I can refer people to it.

A couple of join types that are commonly used but rarely explicitly labeled are semi-joins and anti-joins.  In a regular join you can get multiple results back for a single row because a row in Table A may match more than one row in Table B.  However with a semi-join you are only getting one row back from Table A regardless of how many matches there are for that row in Table B.

Just like there are right and left outer joins there are right and left semi-joins.
I have two tables, People and Employers.  People has 5 records.

Employer has 10000+ records.  The top 5:

When I originally wrote this post my Employer table only had 5 records as well but I couldn’t produce all the examples I wanted with that so I add 10,000 records or so to the table.  I did this by cross joining AdventureWorks table Purchasing.Vendor with itself.
Continue reading Semi-joins and Anti-joins

The Value of Expertise (or Are You Sure About That?)

One of the things I love to do is sail.  The first time I was on a sailboat, other than a Sunfish, was a 30ft yacht pounding to weather into 25 knot winds and heeled about 20 degrees.  It hooked me. Having dreamed about owning a boat since then I heard many cautions from my friends and family.  “You know what boat stands for…Break Out Another Thousand.”  Or, “A boat is a hole in the water you throw money into.”  Not one to be deterred by sanity I decided it is time to fulfill that dream and have started the hunt for the right sailboat for me.

I’ve been reading about boats for 10 years now and thought I would be able to make a wise decision. What I discovered is that despite all those years of study I still don’t know enough.  I’ve heard most of the conventional wisdoms.  Full keel boats are safer.  Ketches are slow.  Flat bottom boats will pound you to death. Sailboats with long overhangs are tender.  There is probably a kernel of truth to most of these statements but I know from studying that they aren’t always true.  The problem is that I don’t know when they aren’t always true.

The realization was that I can speak the words of boats but I don’t know the language.  Tillers, shrouds, booms, boom vang, wind vane, gooseneck, bow pulpit, spars, topsides, shear, stanchions, cutter rigs, ketch rigs, Bermuda rigs, yawls, weather helm, sheets, head up, head down, soft chine, hard chine, full keel, skeg, prop, spade rudder, tack, gybe, lift, slip, windward, leeward…it goes on and on.  I understand the parts but I don’t understand the whole.  I understand that if you put a long fin keel on a sail boat it will generally point better than a boat full keel. But I don’t know what compromises are made when you do that.  What do I lose? Safety? Stability?  I have an idea but I don’t know for certain.

Because of this I have decided to seek the advice of an expert.  I contacted a renowned marine architect of small pleasure sailboats to help me navigate this maze.  He has literally (and I don’t literally mean figuratively, I literally mean literally) designed hundreds of boats. He’s had to decide what compromises to make.  When designing a boat he has to know the potential downsides of every design decision and when other parts of the design can exacerbate those downsides or counter them. When buying a used boat you have to make decisions about what features you want.  Deciding on one feature necessarily rules out others.  He can advise me of what I’m giving up with every decision I make and every feature I choose. That’s the kind of guy I need in my corner. I would hate to make a decision based on ignorance and it cost me tens of thousands of dollars or years of loss of enjoyment because I bought the wrong boat. By working with him I don’t guarantee that I’ll make the right decision but I increase my odds significantly.

I can’t help but think about how this translates to the technology field, specifically, SQL Server.  There are thousands of people out there whose knowledge of SQL Server is like my knowledge of sailing.  They know the words but they don’t know the language.   Because of the nature of their jobs they can’t spend hours every day studying and working with SQL Server. They can’t have developed hundreds of databases and have had to make the compromises that one must make. They are good developers and good project managers. They can talk very intelligently about SQL Server and can make good decisions if they have the information.  But they don’t. Just like my purchase of a boat they run the risk of making a decision born from ignorance  that can cost them tens of thousands (and in some cases, I’m sure, hundreds of thousand) of dollars.

This isn’t some sort of shill to promote myself.  I’m not a consultant. This message has personal impact too. Every day I work with SQL Server I run into something I don’t know. Some things can be solved with a Google search.  Others require me to seek the advice of those who know more than I do.  It is a humbling thing to admit that I don’t know something but the downsides of making a bad decision based on ignorance can be great.  I’ve had to learn how to swallow my ego. There is a long game involved too.  It’s much easier to swallow my ego now than it is when a decision I made seriously impacts delivery or production or costs my company lots of money. I know your company is different. They don’t care about wasted money.

When I learned how much it cost to hire the marine architect to help me I initially balked.  No one wants to spend money they don’t have to.  On the other hand the potential costs of not hiring him are huge.  In the grand scheme it isn’t that much money and in a month or two I won’t even miss it.  I’ve spent a lot more money on a lot stupider things.

Companies should make these same considerations.  Hiring a SQL Server expert to consult on a design doesn’t have to cost a fortune. The cost of some expert guidance can be tiny compared to the total development cost of a single project.  If you have good people who know the words, the expert can help them with the language. The consultant doesn’t have to do the work. The consult is there because he knows stuff you don’t know, and given your time constraints likely can’t know. A few days consult here and there can help guide the project and avoid the pitfalls that could cost tens of thousands of dollars in the long run.

Anatomy of a hack (SQL Injection woes)

When I first heard about SQL injection attacks and learned how to prevent them I was confused.  I didn’t understand how introducing a potentially destructive string into a batch could produce different results depending on how it was introduced.  Include ‘;DROP TABLE ToDrop-- as a search condition in a WHERE clause and usually your query returns no data. Include it in the construction of a dynamic SQL string and you’ve lost not only your favorite spaghetti rum cookie recipe but everything that was in the table that just got dropped.

Eventually I learned. The reason the string fragment is sometimes dangerous and other times benign is that the string gets introduced to the query processor at different times and, depending on how and when it is introduced, it is treated differently.

The compilation process follows a basic pattern of parsing the T-SQL batch, binding your statements to real objects, optimizing the statements to produce a good-enough execution plan and, finally, executing the query.  Understanding the entire compilation pipeline isn’t necessary to understand SQL injection attacks though. The important thing to know is that during the compilation stage the T-SQL gets parsed by the Query Processor. If someone can introduce an errant string into your T-SQL code text before it is parsed they can modify how your T-SQL code is interpreted.  The parsing of the batch is the danger zone of SQL injection attacks.

A batch can contain multiple independent SQL Statements.  SQL injection attacks take advantage of this by inserting another SQL statement into the batch as it is getting compiled so that will be executed as if it were supposed to be there. Executing a batch that contains two simple SQL statements can look something like this:

SELECT * FROM Test WHERE Col1 = 1;SELECT * FROM Test WHERE Col2 = 'aa'
GO

This produces an execution plan that looks like this:

What this shows us is that two independent SQL Statements were executed as part of a single batch. It also shows us that a carriage return isn’t necessary to separate the two SQL statements.   In T-SQL the semi-colon functions as a statement separator.

Now, imagine giving an external user access to SSMS and the ability to write the commands directly into a query window that has access to your database.  When we use EXEC and dynamic SQL we risk that exact scenario. The EXEC command creates a batch that is independent of the batch that launches it.  This batch follows the same basic compilation process of every other batch with parsing being the first step.  This query, plus a peek at the plan cache, demonstrates that this is the case.

DECLARE @stmt varchar(100)
SET @stmt = 'SELECT * FROM dbo.Test WHERE Col1 = 1'
SELECT * FROM dbo.Test WHERE Col1 = 1
EXEC(@stmt)

Line 3 is the batch created by the T-SQL shown above.  Line 2 is the independent batch created by the EXEC command in that T-SQL.

Knowing that when parsing a batch SQL Server treats the semi-colon as a statement separator and that the EXEC command will result in a new batch being created is the key to understanding how SQL injections attacks work. If we allow an outside individual to affect the string that gets fed to the EXEC command that individual can inject syntax that modifies how the batch is parsed and, ultimately, what gets executed. They can do this if we create a dynamic SQL statement that directly incorporates a string which is passed to us from some web form or win32 application.  This sample simulates this scenario.

DECLARE @inject varchar(100)
DECLARE @var char(1)
DECLARE @stmt varchar(100)
SET @var = '1'
SET @inject = ''''';DROP TABLE ToDrop--'
SET @stmt = 'SELECT * FROM dbo.Test WHERE Col2 = ' + @inject + ' AND Col1 = ' + @var
EXEC(@stmt)

In this sample I am declaring @inject and @var as variables but these could easily be input parameters of an encapsulating stored procedure.  When @inject and @var get substituted into @stmt the resulting string becomes:

SELECT * FROM dbo.Test WHERE Col2 = '';DROP TABLE  ToDrop-- AND Col1 = 1

This creates a T-SQL batch that when passed into EXEC is parsed like every other batch submitted to SQL Server.  The semi-colon causes what was intended to be a single command to be separated into two separate commands.  The -- at the end of the injection causes the remainder of the command to be treated as comments. This keeps the injected string from causing a syntax error since there could easily be more T-SQL in the string whose syntax is no longer compatible with the newly formed T-SQL.  I simulate this by adding a another search condition. If this second statement is syntactically correct and binds to objects that really exist in the database, it will execute.  The query plan produced by the sample above looks like this:

Notice -- AND Col1 = 1 at the end of Query 2.  If AND Col1 = 1 had been added to the end of DROP TABLE ToDrop there would have been a syntax error.

So how do we protect ourselves from SQL injection attacks?  If we must use dynamic SQL then using sp_executesql is the key. Rewriting the previous SQL statement using sp_executesql looks something like this:

DECLARE @inject nvarchar(100)
DECLARE @var int
DECLARE @stmt nvarchar(100)
DECLARE @paramdef nvarchar(500)
SET @var = N'1'
SET @inject = N''''';DROP TABLE ToDrop<code>--</code>'
SET @paramdef <a href="mailto:=N'@param1">=N'@param1</a> varchar(100),@param2 int'
SET @stmt = N'SELECT * FROM dbo.Test WHERE Col2 = @param1  AND Col1 = @param2'
EXEC sp_executesql @stmt,@paramdef,@param1 = @inject,@param2=@var

When the statement in @stmt is parsed during compilation @param1 and @param2 are practically ignored.  SQL Server understands that these are placeholders for values that will be included later and produces a parameterized query.

Notice there is no semi-colon separating two statements in the batch.  There is no drop table statement. And notice the execution plan is safe.

When this query is finally executed the values in @inject and @var are substituted for @param1 and @param2.  The previously dangerous string is now introduced as a simple search condition that returns no results because there is no record where Col2 = ”;DROP TABLE ToDrop--

In the sp_executesql example there is no direct path from @inject to the batch.  It is never included during the parsing of the select statement and thus cannot be injected into the batch. This doesn’t mean that sp_executesql by itself is the panacea for all injection woes.  It must be used correctly.  It is possible to execute the unsafe statement from the first example with sp_execute as well.  To make it safe you must use parameters properly.

If you don’t want to use parameterized queries but still want to be protected from SQL Injection you can use QUOTENAME and REPLICATE to sanitize your input. Examples of that are here.

Dude, where’s my plan (part 2).

In part 1 I looked at the general makeup of the plan cache and how SQL Server decided where in the plan cache to save a plan.  Now I will look at how SQL Server finds a plan in the plan cache so it can be reused instead of precious resources being wasted for plan recompilation.  When SQL Server needs to look up a plan it uses the same hashing algorithm it used to decide where to save the plan.  This algorithm relies on the object id, the database ID and the size of the cache in question. This algorithm produces a bucket number and it is in this bucket of the specific cache store where SQL Server looks for a saved plan.  This sounds vague but part 1 has a lot more detail. I’m trying not to rehash that here.

For objects (like views, triggers, etc) the object id used in the hash is simply the object id of the object in question. These object ids can be seen by querying sys.objects. Ad hoc queries though don’t have an object id.  For these the batch is hashed to produce an object id.  Once you know the algorithm that SQL Server uses to cache a plan it now makes sense why ad hoc queries which don’t text match cannot be reused by a logically similar but physically different query.  The object id produced by each query will be different and if that object id is different the bucket used to store that plan will be different.

For example these two select statements, though logically similar and producing practically the same plan, get stored in different buckets.  The difference in these queries is that one of them has an extra space between the equal sign and the number one.

Once the proper bucket is discovered the plans stored in that bucket have to be examined to see if one matches the query being executed.  The question then becomes, what is SQL Server looking at to determine if a query being executed matches an execution plan that is saved in the plan cache?

When a query is being run there are attributes that can affect the execution of the query.  For example, a session with one specific SET option may return different results than a session with different SET options.  Collectively these attributes make up what is known as the cache key.  You can see which attributes comprise the cache key for a plan by querying sys.dm_exec_plan_attributes and looking at the is_cache_key column.

select st.text,pa.*
from
sys.dm_exec_cached_plans cp
cross apply sys.dm_exec_sql_text(cp.plan_handle) st
cross apply sys.dm_exec_plan_attributes(cp.plan_handle) pa
where st.text = 'SELECT * FROM dbo.Test WHERE Col1 = 1'

If any of the attribute values for the query being executed don’t match a plan already stored in the bucket then the cached plan will not be reused, even if the text of the query matches and the results of the query will be the same.

To show this I have two examples.

Example 1: I created a small test database and opened two sessions in SSMS.  In one of the sessions I set the DATEFORMAT to dmy. In the second session I left the DATEFORMAT as the default, which is mdy on my instance.

Then I ran this query in each one:

SELECT * FROM dbo.Test WHERE Col1 = 1

Looking at the plan cache I see two cached plans with same bucketid. I know they are different plans because they have different plan handles.

To see which cache key values don’t match I compare the cache key attributes side by side.

The values for the date_format attribute are not the same. Even though the plans are stored in the same hash bucket two different plans will be stored.

Example 2: In the above plan attribute comparison notice that one of the attributes is user_id.  For this attribute Books Online says

Value of -2 indicates that the batch submitted does not depend on implicit name resolution and can be shared among different users. This is the preferred method. Any other value represents the user ID of the user submitting the query in the database.

If the object names in the query aren’t qualified by a schema name the search for the object will start with the users default schema.  If each user happened to have a table or view named Test in their database this is the object that would be found and the plans generated would actually be querying completely different tables.

For this second test I still use two session but I am executing the query as a different user in each session. Each of these users has a different default schema. I am also removing the schema qualification from the query: SELECT * FROM Test WHERE Col1 = 1

A comparison of the plan attributes for the different plans produced shows that now the user_id attribute is different.

In this particular database “Test” resolves to the Test table in the dbo schema.  But that can’t be guaranteed without qualifying the table name with the schema that owns it. This is one of the reasons it is a good idea to qualify objects with the schema the object lives in.

I can envision a poorly written application that uses many ad hoc queries and has database users all defaulting to their own schema.  A common query executed by each client application (such as querying some lookup tables) could each produce a query that will be cached in the same bucket as every other client running the same query.  Since their cache keys won’t match each instance of the application will produce another plan to be put in the same bucket, potentially producing a very long hash chain that must be traversed every time that query is run.  If the lookup tables are used often this could result in an application with reduced performance.

Is this scenario likely? No, but something similar could happen with any of the attributes designated as cache keys.

DBCC CHECKPRIMARYFILE

I discovered this undocumented command one day when I was looking for a way to loop through a bunch (as in hundreds) of unattached SQL Server database and collect information about each one to put into a table.  This was the perfect command to do that.  It takes two parameters.  The first is the path to a SQL Server datafile.  The second is an ordinal, 0 to 3.
Option 0 returns one column, IsMDF.  IsMDF is 1 if the file is the primary data file for the database, 0 if not.  While the column name is IsMDF if you run it against an MDF file that isn’t the primary file it will still return false.  I assume this comes from the suggested SQL Server naming convention that all data files but the primary use a .ndf extension.

Once you know that the data file is the primary data file option 1 will return info about the other files and file groups that are included in that database.  This information can be used to create dynamic SQL to attach the database.  Being able to determine this information dynamically is useful if the databases to be attached vary in the number, type and location of the secondary files.  Running this against a data file that isn’t the primary file will produce an error.  Option 3 produces an abbreviated version similar to Option 1.

Option 2 returns information about the database itself including the database name (which isn’t always the same as the file name), the database version (no need to try to attach a database that is a higher version or accidently upgrade a database by attaching it to a higher version), and the database collation.

The usually caveats apply when talking about undocumented commands.  They generally shouldn’t be used in production.

Dude, where’s my plan (part 1).

When a plan is cached it is stored in one of four cache stores.  Query  sys.dm_os_memory_cache_counters to see the cache stores in SQL Server.  There are four cache stores that handle plans and plan related objects.  They are Object Plans, SQL Plans, Bound Trees, and Extended Stored Procedures.

SELECT *
FROM sys.dm_os_memory_cache_counters
WHERE type in
    ('CACHESTORE_OBJCP','CACHESTORE_SQLCP',
     'CACHESTORE_PHDR','CACHESTORE_XPROC')

Each of these cache stores holds different types of plans.  In general I’m only going to look the cache stores for Object Plans and SQL Plans.  The Object Plans cache store holds the plans for stored procedures, triggers and functions.  The SQL Plans cache store holds the plans prepared queries, ad-hoc queries and parameterized queries.

Each cache store contains a hash table. I think the wikipedia entry about hash tables is a pretty good intro for those who don’t really know much about them (as I didn’t before I started studying the plan cache.) Each hash table contains multiple buckets and it is in these buckets that the plans are stored.  Typically in a hash table a key identifies a single index value (though sometimes multiple keys can point to the same index producing what is known as a hash collision…more on this later).   This index value is the bucket (or slot) that contains the item being stored.   The number of buckets each cachestore can be seen by querying sys.dm_os_memory_cache_hash_tables.

The question is, how does SQL Server use the plan cache and why do I care?  The reason I care is because understanding how plans are hashed and stored can help me identify when I’m having problems with plans not being reused or if I’m having performance problems because of long hash chains.    Also, it is nifty to see how SQL Server is working behind the scenes.

According to Kalen Delaney the hash key used is determined by this algorithm: (object_id * database_id) mod (hash table size).

What this should mean is that if I take the object id, multiply it by the database id and then mod it by the hash table size I should get the bucket that the plan is going to be stored in.  For my first test this worked great but I noticed that this didn’t work consistently for all my plans.  So after a little research I learned that the formula above isn’t exactly correct.  Well, I’d better say, it doesn’t exactly work like I thought it would.

I was still thinking in terms of unlimited storage space for a number.  Unlike people who’ve been programming in C for years I wasn’t thinking that the result of a 32-bit unsigned integer multiplied by another 32-bit unsigned integer had to fit in the space allowed to a 32-bit unsigned integer.  The chance of overflow is great.  My first example that appeared to work didn’t overflow the space allowed so I didn’t see any problems with it.  As I began to look at other examples the results weren’t correct and I had to figure out why.

One of the ways of solving the overflow problem is to wrap back around when you run out of space.  A 32-bit unsigned integer can hold a range of values from 0 to 4294967295, or 4294967296 values total.  If you are at 4294967295 and add 1 to that the result is zero.  Since multiplication is simply adding the same number over and over again it should be relatively easy to figure out what the result would be after wrapping around again and again.

Rather than figure that out I looked for a shortcut.  If the two numbers I multiple together are greater than 4294967295 I can mod this number by 4294967296 to get what the final number would have been had I manually calculated what it would eventually wrap around to.

So now the trick is to test the theory out on the compiled plans in the plan cache.

For objects like stored procedures and triggers this is relatively easy because the object id and the database id are presented in sys.dm_exec_query_plans DMV.  For ad hoc plans it is a bit harder.  The object id is created by hashing the batch text and it isn’t shown in sys.dm_exec_query_plans.  It is however shown in sys.dm_exec_plan_attributes.  The problem with sys.dm_exec_plan_attributes is how it presents each attribute.  Each attribute is a row and not a column in the DMV.  This means to get it in row form I have to pivot it.

;WITH PlanAttributePivot (plan_handle, objectid, dbid)
AS
(
    SELECT plan_handle, pvt.objectid, pvt.dbid
    FROM
    (
        SELECT plan_handle, epa.attribute, epa.value
        FROM sys.dm_exec_cached_plans
        OUTER APPLY sys.dm_exec_plan_attributes(plan_handle) AS epa
        WHERE cacheobjtype = 'Compiled Plan') AS ecpa
        PIVOT (MAX(ecpa.value) FOR ecpa.attribute IN
        ("objectid", "dbid")
    ) AS pvt
)
SELECT *
FROM PlanAttributePivot PA
INNER JOIN sys.dm_exec_cached_plans cp
on cp.plan_handle = PA.plan_handle

Now that I have the object ids and database ids I need to put it through the substitute hashing function I created, (((objected * databaseid) MOD 4294967296) MOD buckets_count).  Since the plans I’m interested in are stored in different cachestores there can be a different bucket count for each store.  In my query I have to delinate by the store the plan will be stored in and use the bucket count for that store. This query worked for all the plans in my plan cache with the exception of CLR Compiled Func which appear to be stored in a different store altogether.

DECLARE @obj_buckets_count int
DECLARE @sql_buckets_count int
DECLARE @boundtree_buckets_count int
DECLARE @xproc_buckets_count int
SELECT @obj_buckets_count = buckets_count
FROM sys.dm_os_memory_cache_hash_tables
WHERE type = 'CACHESTORE_OBJCP'
SELECT @sql_buckets_count = buckets_count
FROM sys.dm_os_memory_cache_hash_tables
WHERE type = 'CACHESTORE_SQLCP'
SELECT @boundtree_buckets_count = buckets_count
FROM sys.dm_os_memory_cache_hash_tables
WHERE type = 'CACHESTORE_PHDR'
SELECT @xproc_buckets_count = buckets_count
FROM sys.dm_os_memory_cache_hash_tables
WHERE type = 'CACHESTORE_XPROC'
;WITH PlanAttributePivot (plan_handle, objectid, dbid)
AS
(
    SELECT plan_handle, pvt.objectid, pvt.dbid
    FROM
    (
        SELECT plan_handle, epa.attribute, epa.value
        FROM sys.dm_exec_cached_plans
        OUTER APPLY sys.dm_exec_plan_attributes(plan_handle) AS epa
    ) AS ecpa
    PIVOT (MAX(ecpa.value) FOR ecpa.attribute IN
    ("objectid", "dbid")) AS pvt
),
CalculatedBuckets (plan_handle,bucketid)
AS
(
    SELECT
    PA.plan_handle
    ,(cast(PA.objectid as bigint) * cast(PA.dbid as bigint))%4294967296%@obj_buckets_count [bucketid]
    FROM PlanAttributePivot PA
    INNER JOIN sys.dm_exec_cached_plans cp
        ON cp.plan_handle = PA.plan_handle
    WHERE cp.objtype in ('Proc','Trigger') and cacheobjtype <> 'Extended Proc'
    UNION ALL
    SELECT
    PA.plan_handle
    ,(cast(PA.objectid as bigint) * cast(PA.dbid as bigint))%4294967296%@sql_buckets_count [bucketid]
    FROM PlanAttributePivot PA
    INNER JOIN sys.dm_exec_cached_plans cp
        ON cp.plan_handle = PA.plan_handle
    WHERE cp.objtype in ('Adhoc','Prepared')
    UNION ALL
    SELECT
    PA.plan_handle
    ,(cast(PA.objectid as bigint) * cast(PA.dbid as bigint))%4294967296%@boundtree_buckets_count [bucketid]
    FROM PlanAttributePivot PA
    INNER JOIN sys.dm_exec_cached_plans cp
        ON cp.plan_handle = PA.plan_handle
    WHERE cp.cacheobjtype in ('Parse Tree')
    UNION ALL
    SELECT
    PA.plan_handle
    ,(cast(PA.objectid as bigint) * cast(PA.dbid as bigint))%4294967296%@xproc_buckets_count [bucketid]
    FROM PlanAttributePivot PA
    INNER JOIN sys.dm_exec_cached_plans cp
        ON cp.plan_handle = PA.plan_handle
    WHERE cp.cacheobjtype in ('Extended Proc')
)
SELECT CB.bucketid,cp.bucketid,*
FROM CalculatedBuckets CB
INNER JOIN sys.dm_exec_cached_plans cp
    ON cp.plan_handle = CB.plan_handle
--WHERE cp.bucketid <> CB.bucketid

With this I can easily compare the bucket ids SQL Server produced with the pseudo hash I created.  Uncommenting the WHERE clause at the end and I can see where they don’t match.   I’m pleased with the results.  The database I tested this on had over 14,000 plans in the cache store and the only ones that didn’t match were the CLR compiled functions.  A little research showed that these probably live in a different store so I didn’t worry about it.

One of the problems with a hash table is the chance of hash collisions.  This happens when hashing different keys produces the same hash value or bucket.  This leads to multiple plans being stored in the same bucket.  When SQL Server is looking for a plan to reuse it uses the same hashing algorithm to determine the bucket to look in.  It then must search each plan in the bucket to see if it matches certain criteria (more on this in the next post).  If the number of plans in the bucket is excessive SQL Server will have to search through a large number of plans to find one that matches (and to find if there isn’t a matching plan).  Looking at the columns in sys.dm_os_memory_cache_hash_tables will show some general information about how the buckets have been used.  The column buckets_max_length tells you how long the longest hash chain is.

In my next post I plan on looking more at how plans are reused and talk about how SQL Server finds the right plan after it has found the right bucket.

 

Stored Procedures in the procedure cache

I’ve written about auto-parameterization, forced parameterization, the ‘optimize for ad-hoc workloads’ option and sp_exectutesql.   Before pulling all of these ideas together to discuss query plan reuse there is another very important component that I haven’t talked about. That is stored procedures.  Like parameterized queries whose execution plans get stored in the plan cache and can be reused by similar queries, the execution plans for stored procedures also get stored in the plan cache.  From a plan cache use perspective stored procedures provide some benefits over parameterized queries.  They also however are different in some very important and nuanced ways.  The point of this post isn’t to examine all of the pros and cons of stored procedures but to look at how they and the individual statements in them interact with the plan cache.

First I need to create a table with some example data.

IF object_id(‘dbo.Skew’) IS NOT NULL
DROP TABLE dbo.Skew
CREATE TABLE dbo.Skew
(
Col1 int,
Col2 char(2)
)
GO

DECLARE @c int
SET @c = 0
WHILE @c < 1000
BEGIN
INSERT INTO Skew (Col1,Col2) VALUES (@c,’aa’)
SET @c = @c + 1
END
GO

INSERT INTO Skew (Col1,Col2) VALUES (1000,’aa’)
GO 1000

INSERT INTO Skew (Col1,Col2) VALUES (2000,’aa’)
GO 1000

INSERT INTO Skew (Col1,Col2) VALUES (3000,’aa’)
GO 1000

INSERT INTO Skew (Col1,Col2) VALUES (4000,’aa’)
GO 1000

INSERT INTO Skew (Col1,Col2) VALUES (5000,’aa’)
GO 1000

INSERT INTO Skew (Col1,Col2) VALUES (6000,’aa’)
GO 1000

INSERT INTO Skew (Col1,Col2) VALUES (7000,’aa’)
GO 1000

INSERT INTO Skew (Col1,Col2) VALUES (8000,’aa’)
GO 1000

INSERT INTO Skew (Col1,Col2) VALUES (9000,’aa’)
GO 1000

CREATE INDEX SkewIndex ON dbo.Skew (Col1)

This data is skewed so different search arguments will produce different query plans. The purpose of this is to show how, like other parameterized queries, a stored procedure can produce a query plan that may not be the best for subsequent queries with different search arguments.
The query I will be using for most of my examples:

CREATE PROC ExampleProc (@Col1 int, @Col2 nchar(2))
AS
SELECT Col2 FROM Skew where Col1 = @Col1
GO

And this can be used to look at the procedure cache:

select t.text, cp.bucketid, cp.usecounts, cp.size_in_bytes, cp.cacheobjtype, cp.objtype, qp.query_plan
from sys.dm_exec_cached_plans cp
outer apply sys.dm_exec_query_plan(cp.plan_handle) qp
outer apply sys.dm_exec_sql_text(cp.plan_handle) t
where (t.dbid < 32767 or t.dbid is null)
order by objtype,cacheobjtype,t.text

First I want to look at what happens when this query is run outside of a stored procedure, i.e. as an ad hoc query.
SELECT Col2 FROM Skew where Col1 = @Col1

Substituting @Col1 with 1 and running the query produces a plan that looks like this:

But running the query substituting @Col1 with 1000 produces an execution plan that looks like this:

If you recall from reading the post about auto and forced parameterization a parameterized query will create a plan that will continue to be used for future queries that only differ by search arguments.  In this instance the query was not auto-parameterized nor is forced parameterization in effect.  Because of this the optimizer is free to produce a plan that is best suited for the specific search arguments used in each execution.  The upside of this is that the execution plan produced is crafted for the best performance.  The downside is that a lot of queries with different search arguments will produce a lot of different plans and all of these plans will be cached.  This can be tested by changing the search argument to any number and examining the buildup of plans in the plan cache.

When this query is included in a stored procedure it operates much like a parameterized query.
The stored procedure is compiled and, because of parameter sniffing, the execution plan produced is optimized for the search arguments used on the initial compilation.  If the procedure is called again with values that would perform faster with a different execution plan SQL Server will not take advantage of a new plan without specific instructions from the author.
While a stored procedure can protect the plan cache from being filled by a lot of plans it can also cause SQL Server to use a sub-standard plan.
If I call the stored procedure with these parameters

EXEC ExampleProc 1,’aa’

The execution plan looks like this.

This plan is similar to the plan created when the query in the stored procedure (this one: SELECT Col2 FROM Skew where Col1 = 1) was run as an ad-hoc query.
If I change the parameter from 1 to 1000

EXEC ExampleProc 1000,’aa’

the stored procedure will keep the plan used during the initial compliation rather then using a table scan, which we know from running the ad hoc query earily would be the quicker way to retrive the data for this specific search argument. While table scan, which would be the most efficient way of getting the data back, SQL Server is stuck using the cached execution plan. While normally I would paste a picture of the query plan after this as proof of what is happing that would be pretty dumb because the query plan in the plan cache is the same exact plan I just posted.  However, one can see that the same cached plan is being used by looking at the usecounts column.  The number increments by one each time the plan is used.  I executed the stored procedure 5 times and the usecounts value is 5.  Not a coincidence.

There are a few ways to mitigate the problem of a statement in a procedure using an substandard query plan.

1. One could create the procedure with the WITH RECOMPILE option
CREATE PROC ExampleProc (@Col1 int, @Col2 nchar(2))
WITH RECOMPILE
AS
SELECT Col2 FROM Skew where Col1 = @Col1
GO

WITH RECOMPILE is a misnomer. It doesn’t actually force the query to recompile, technically. When running a procedure that was created with WITH RECOMPILE the procedure is compiled, the procedure is executed and then the plan that was created is discarded.  A better description of WITH RECOMPILE would be “WITH DON’T SAVE IN THE PLAN CACHE” but I guess Microsoft didn’t want us to have to type all of that.

It is impossible to query the plan cache to see what plan is used (cause it isn’t cached for us to look at) so instead I will enable viewing the execution plan through the “Include Actual Execution Plan” button on the SSMS toolbar.

Clearing the procedure cache using DBCC FREEPROCCACHE, then executing the stored procedure with the first parameter being 1 then this plan is created.

And this is what is in the proc cache:

The SET STATISTICS_XML commands are associated with enabling and disabling the showing of the execution plan in SSMS and can be ignored.  The more optimal plan was created and the plan wasn’t saved in the plan cache.

Changing the first parameter of the stored proc to 1000 the new plan looks like this:

Looking at the procedure cache shows that the plan was not cached. Unlike the previous examples the plan changed as the search argument changed.

2. Use WITH RECOMPILE when executing the stored procedure.
Adding WITH RECOMPILE to the EXEC command will cause that specific execution of the procedure to compile from scratch and to not be cached.  If the procedure had been executed previously without the WITH RECOMPILE option and cached that instance in the plan cache will be ignored by future executions of the procedure that use WITH RECOMPILE.  Freeing the procedure cache and running the query a few times can show it in the plan cache. First I have to remove WITH RECOMPILE from the procedure itself and recreate it so it will get saved in the cache.

I ran the query five times with the first parameter equal to 1 to show that the usecount is now 5 and that the plan includes a bookmark lookup.  The cached query plan looks like this:

Now running the procedure again with the WITH RECOMPILE option, the 1st parameter set to 1000 and Include Actual Execution Plan turned on:

EXEC ExampleProc 1000,’aa’ WITH RECOMPILE

The plan cache shows that the use count is still 5 and the execution plan produced shows that a table scan is used instead of the bookmark lookup that is in the cached plan.

Executing the procedure again without the WITH RECOMPILE option and with the 1st parameter as 1000 causes the usecount to be incremented by one and the plan used is the cached sub-optimal plan.

3. Use OPTION (RECOMPILE) on an individual query in the stored procedure.

CREATE PROC ExampleProc (@Col1 int, @Col2 nchar(2))
AS
SELECT Col2 FROM Skew where Col1 = @Col1 OPTION (RECOMPILE)
GO

Watching the usecounts of this procedure in the plan cache shows that the procedure itself is being reused but looking at the query plan each time also shows that this specific statement is being recompiled with each use and that the optimal query plan is being created depending on the value being passed into the first parameter.  Adding a second statement in the procedure allows one to see the difference between how different statements are handled by SQL Server.

CREATE PROC ExampleProc (@Col1 int, @Col2 nchar(2))
AS
SELECT Col2 FROM Skew where Col1 = @Col1
SELECT Col2 FROM Skew where Col1 = @Col1 OPTION (RECOMPILE)
GO

This query shows each of the SQL Statements in the batch:

select
case
when sql_handle is not null then
case
when qs.statement_end_offset <> -1 then
substring(t.text,statement_start_offset/2+1,(qs.statement_end_offset/2-qs.statement_start_offset/2)+1)
else substring(t.text,statement_start_offset/2+1,(len(t.text)*2-qs.statement_start_offset)/2+1)
end
else null
end [query_text],
cp.usecounts,qs.execution_count,qs.plan_generation_num,cp.cacheobjtype,cp.objtype,qp.query_plan,qs.*
from sys.dm_exec_cached_plans cp
outer apply sys.dm_exec_query_plan(cp.plan_handle) qp
outer apply sys.dm_exec_sql_text(cp.plan_handle) t
left join sys.dm_exec_query_stats qs
on qs.plan_handle = cp.plan_handle
where (t.dbid < 32767 or t.dbid is null)
order by objtype,cacheobjtype,t.text

Clearing the procedure cache, running this 5 times and looking at the plan cache shows us what is going on:

EXEC ExampleProc 1000,’aa’

The usecounts column is related to the stored procedure itself.  The execution _count column is for the SQL Statements.  In this example the select without the RECOMPILE option has executed the same number of times the procedure has been used.  The select with the RECOMPILE option has been executed once but the plan generation number shows that it has been compiled 6 times, once for the initial compilation and 5 times for the actual executions of the procedure.

The statement level recompile options is great for those times when a small number of statements in a procedure perform best when the plan for that query is optimized each instance it is run.

 

Choosing which queries to parameterize (or don’t work so hard, SQL Server)

Auto-parameterization, i.e. simple parameterization, and forced parameterization will parameterized queries for you while you sit back sipping on a beer.  We don’t really have a choice with auto-parameterization (though we can force individual queries to not be stored in the plan cache if we choose) and forced parameterization is like using a Sledgehammer to remove coffee stains from a tea cup.  Forced parameterization works against many ad-hoc queries. It might parameterize some queries you would rather have compiled each time they run and it might fail to parameterize some queries you want to keep in the plan cache.  The question is how do we get more granular in deciding which queries to parameterize and which queries to not parameterize?

SQL Server gives us two basic options.  We can turn on forced parameterization and choose individual queries to be compiled at each use or we can leave forced parameterization off and choose specific queries to parameterize (and cache) each time they are run.

One of the ways of choosing the queries to parameterize is the system stored procedure sp_executesql. Sp_executesql allows us to selectively parameterize queries and cache the plans for those queries for easy reuse.  Using sp_executesql, forced parameterization can be disabled and individual queries written to take advantage of SQL Server ability to parameterize.

A confusing things about sp_executesql is the number of times the word parameter is used as part of understanding the syntax. First there are the parameters that are passed into sp_executesql. Then there are the parameters that are part of the sql statement, the parameter list, which is a single string that contains a list of parameters. And then there are the parameters themselves which correspond to the parameters in the parameters list which correspond to the parameters in the sql statement.  That last sentence makes me tired. And in some cases the parameters are part of a single string variable while in another they are individuals.  It can get confusing for someone using sp_executesql for the first time.

I use these definitions to help clarify which parameters are which.
Procedure parameters: these are the parameters of sp_executesql. The number of parameters is variable.  It includes the SQL Statement, and can include the string that contains the list of parameters for the SQL Statment plus a parameter for each parameter in the SQL statement.

Statement parameters: the parameters that are part of the SQL statement to be executed. The SQL statement and the included parameters are all part of the same string variable.

List Parameters: This is a list of the parameters of the SQL statement. The items in this list will correspond with the statement parameters. It is a single string of all the parameters, not individually declared variables.

Value Parameters: These are the parameters that hold the values that are eventually passed to the SQL Statement.  They are actually parameters of sp_executesql so they are also procedure parameters.

An example:
First we need to create the SQL statement which will include the statement parameters
SET @stmt = N’SELECT Col1 FROM Test WHERE Col1 = @1 AND Col2 = @2′

Then create the list parameters.  These are actually part of a single string
SET @listparams = N’@1 int,@2 nchar(2)’

Notice that the parameters in the list parameters string match the statement parameters in the SQL statement.
Then some variables that will hold the values that will eventually be passed to the SQL Statement.  These are also the value parameters.
SET @val1 =1
SET @val2 = ‘aa’

Put them all together and this is what you get.
DECLARE @stmt nvarchar(200)
DECLARE @listparams nvarchar(200)
DECLARE @val1 int
DECLARE @val2 nchar(2)
SET @stmt = N’SELECT Col1 FROM Test WHERE Col1 = @1 AND Col2 = @2′

SET @listparams = N’@1 int,@2 nchar(2)’
SET @val1 = 1
SET @val2 = N’aa’

EXEC sp_executesql @stmt,@listparams,@val1,@val2

The variables aren’t technically needed.   It could have also been written like this:
EXEC sp_executesql N’SELECT Col1 FROM Test WHERE Col1 = @1 AND Col2 = @2′
,N’@1 int,@2 nchar(2)’,1,N’aa’

When executing a query using sp_executesql the query is parameterized.  This can be seen by looking at sys.dm_exec_query_cache along with sys.dm_exec_sql_text.

One thing to notice is that unlike when using auto and forced parameterization there is no shell execution plan that contains a pointer to the full plan.  There is only the prepared query.  This is great if you have a procedure cache littered with shell execution plans that all point to a small set of cached plans.  While the shell query plans are typically smaller than the full query plan they still take up memory which may be better used for caching other plans or, since the procedure cache memory is stolen from the buffer pool, better used for caching data and index pages.

Using sp_executesql is a great way to control what gets into the procedure cache. It provides the ability to parameterize queries that won’t be auto-parameterized or to control the auto-parameterization process.

sp_executesql has other benefits as well which I hope to blog about in a future post.