Category Archives: T-SQL

Creating Easy Test Data

So the other day I needed to create some test data that was fairly randomized. I didn’t like any of the examples I found online so I settled on using the RAND function. RAND gives a number between 0 and 1. If I need data that fits in a range I can simply
multiply the number of values in the range with RAND to return a value in the range.

For example if I need some ASCII I can use

SELECT CHAR(65 + FLOOR(RAND() * 25))

For multiple characters I can use this. Yeah, it isn’t elegant but gave me exactly what I needed.

SELECT CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25))

And for BLOBs:

SELECT REPLICATE(CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)),500)

Combine these together for something like

SELECT CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)),
REPLICATE(CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)),500),
DATEADD(SECOND,FLOOR(RAND() * 3024000),'2015-05-31')

Now throw in an INSERT and a GO 1000 and I’ve got some easy data to work with.

CREATE TABLE #TestData (
	ID INT IDENTITY(1,1) NOT NULL,
	SomeData NVARCHAR(100) NOT NULL,
	MaxData NVARCHAR(MAX) NOT NULL,
	DateToSeconds DATETIME NOT NULL
) 
GO

INSERT INTO #TestData (
	SomeData,
	MaxData,
	DateToSeconds
)
SELECT
	CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)) AS [SomeData],
	REPLICATE(CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)) + CHAR(65 + FLOOR(RAND() * 25)),500) AS [MaxData],
	DATEADD(SECOND,FLOOR(RAND() * 3024000),'2015-05-31') AS [DateToSeconds]
GO 1000

SELECT * FROM #TestData AS td ORDER BY ID
GO

Maybe not the most robust solution but it only took 10 minutes and gave me exactly what I needed.

Everything You Ever Wanted to Know About SET NOCOUNT

SET NOCOUNT has to be one of the most used T-SQL statements. I can’t think of a single production script or procedure in which I didn’t use this SET statement. And I have to stretch to come up with a case when I would specifically not use it. The only time I don’t use NOCOUNT is when writing off-the-cuff test scripts. I decided to dig into it a bit and see what it really does behind the scenes.

SET NOCOUNT

From BOL:
“Stops the message that shows the count of the number of rows affected by a Transact-SQL statement or stored procedure from being returned as part of the result set.

SET NOCOUNT ON prevents the sending of DONE_IN_PROC messages to the client for each statement in a stored procedure. For stored procedures that contain several statements that do not return much actual data, or for procedures that contain Transact-SQL loops, setting SET NOCOUNT to ON can provide a significant performance boost, because network traffic is greatly reduced.”

I learned that the first statement isn’t as clear as it could be. I think it is technically true but easy to misunderstand. It appears to say that it stops the count from being returned in a result set. This is not true. The result still returns the count for statements that return data to the client. However, this count is marked as not to be used.

The second account is a bit misleading. It does stop the DONE_IN_PROC messages (tokens, actually, which I will explain later) from being sent. The reduction in network traffic is very case specific. Now it is always better to have less network traffic than more but I can think of only a few cases where a significant difference would be noticed.

Examining this requires a bit of understanding of Tabular Data Streams (TDS). I’m not interested in writing a series of columns about TDS at the moment so I’m going to keep this intro as short as possible. One of the great things about TDS is that we don’t have to know a lot about it ’cause it just works.
Continue reading Everything You Ever Wanted to Know About SET NOCOUNT

REPLICATE

While creating some CLR user defined functions I discovered that using data passed into my nvarchar(max) parameter was getting truncated if it was more than 8000 bytes.

For about a day and a half I suspected that the C# code was not set up to properly handle LOBs and that it was getting truncated. I tried every solution I found online and nothing helped.
pullhairout

I was creating the LOBs using the REPLICATE function. My last resort was to make sure I really was passing in more than 8000 bytes which lead me to read the documentation for REPLICATE.

It says:

If string_expression is not of type varchar(max) or nvarchar(max), REPLICATE truncates the return value at 8,000 bytes. To return values greater than 8,000 bytes, string_expression must be explicitly cast to the appropriate large-value data type.

There was my problem.

Test code:

DECLARE @string nvarchar(max)
DECLARE @result nvarchar(max)

SET @string = N'A'

SET @result = REPLICATE(N'A',4001)
SELECT LEN(@result)

SET @result = REPLICATE(CAST(N'A' as nvarchar(max)),4001)
SELECT LEN(@result)

SET @result = REPLICATE(@string,4001)
SELECT LEN(@result)

Results:
ReplicateLOBResults

This is the link to the request for assistance on SQL Server Central if you are interested in the details of the LOB parameters to CLR functions.

Hair Pulling Photo Attribution: By stewartpilbrow under CC BY-SA 2.0

SQL Injection Info

A list of resources to look at to help understand SQL Injection.

I’ve not seen anything in blog land that describes SQL injection the way I have done so in my original post on the topic. There may be something out there but I don’t know of it.

SQL Injection falls under the rubric of security. Program Manager for the SQL Server Engine team, Il-Sung Lee, gives a talk about developing secure applications. He discusses SQL injection and other interesting things.

Understanding SQL injection requires understanding how the query processor parses a batch. Conor Cunningham gave a speech at SQLbits X about how the query processor works. Not specific about SQL injection but he does talk about the parser and the query tree.

Paul White talks specifically about the parse tree and how you can see the initial tree is. Great for inspecting how a query was parsed.

Column Aliases (or Who Is That Masked Crusader?)

Aaron Bertrand talks about the various methods for column aliasing and which he prefers. My favorite is a variation on #4 in his list (I don’t think he has them in order of preference but numbered them to make them easier to discuss) which is surprising because of the options listed #4 is my least favorite. In the past I used to use = whenever I aliased a column in a select list but over the years I’ve slowly moved to another method.

I will repeat his list here because it’s easier to discuss it without having to constantly refer back to the original post.

SELECT 1 AS x;     -- #1
SELECT x = 1;      -- #2
SELECT 'x' = 1;    -- #3
SELECT 1 x;        -- #4
SELECT 1 AS 'x';   -- #5
SELECT 1 'x';      -- #6

I find #4 to be the least readable, especially when the select list isn’t numbers and the items selected aren’t on different lines. I think this example from AdventureWorks demonstrates the problem.

select ContactID CID,Gender MorF, EmployeeID EID
from HumanResources.Employee

At a glance with a long select list it take more time to interpret and understand. The variation I prefer is like this but using brackets to identify the alias.

select ContactID [CID],Gender [MorF], EmployeeID [EID]
from HumanResources.Employee

One of the reasons I like this method is that I don’t have to rename every column to be consistent. The real column names can also be surrounded by brakets. This makes it easier (though not as easy as Aaron’s preference) to look at a select list and see what column names will be used in the result set. For example:

select
 ContactID [CID]
,[NationalIDNumber]
,Gender [MorF]
,EmployeeID [EID]
,[BirthDate]
from HumanResources.Employee

However I’ve found that the most readable way (for me anyway for for reasons I explain at the bottom of this post) to use this method is to always us [ ] when aliasing a column and never use it when I’m not.

This makes the select list look like this

select
 ContactID [CID]
,NationalIDNumber
,Gender [MorF]
,EmployeeID [EID]
,BirthDate
from HumanResources.Employee

When I see this I know that if I see brackets the column is aliased and if not, it’s not. I suppose there could be an occasion where the brackets are necessary for the real column name in the select list which would result in two brackets in one selection. A table like this:

create table [1test] ([1ID] int)

would require a select like this:

select [1ID] from [1test]

If I wanted to alias 1ID using my convention I’d have to do:

select [1ID] [ID]
From [1test]

I can’t think of an occasion though when this has happened though since I never use a column or table name that requires the brackets in any database that I design.

I appreaciate Aaron’s use of = to alias columns. It makes for very readable code. Though I think the manner in which the alisas are vertically lined up focuses the attention on what the aliases are. For some reason my focus is more often on what columns are being selected from the underlying table (or more specifically, how much of the select list is the result of a function or some other calculation). Lining those up vertically on the left hand side helps me quickly recognize that. Yes it’s easy to look at the = example and see which columns returned are not directly from the underlying table but it is easier for me to scan down the items selected and do a quick mental calcuation of the percent of items that will come from calculations verses those which are selected directly from the underlying table. I’m not sure why this is important to me but it is.

Often when I need to find a specific item in the select list to edit or examine I will open the proc in SSMS and search for that alias and identify the right row immediately. So for me, the method I use (a variation on Aaron’s option #4) provides me with an easy way of knowing what the column names of the return list are but balances that with some other things I want to know at first glance. But I might be a little weird. That is the funny thing about preferences. There are usually some underlying reasons for them that we don’t get until we think about it.

All that being said, if I started using = I could probably adjust pretty quickly and after a while it would be second nature. I’m also not so stuck on my method that I would balk if I worked somewhere that had a coding standard requiring = or AS.

It’s always interesting to see what preferences other SQL Server developer use and why.

Getting Started with SQL on SQL Server (Part 2 – Sets, Rows and Columns)

The number one thing to understand about using SQL to get data from a database is how it operates against data and what it outputs when you are done.  To explain this I must first explain what sets, rows and columns are.  Once you know this you can start doing operations on those sets, rows and columns. You should read my first post if you have not already.  It explains the format I am taking, why I’m writing this, and some of the liberties I am taking in my terminology.

What is a set?

I find it is easier to understand sets if you have a picture to look at because it can give us an instinctual understanding of rows and columns.

An example of a set

This is a graphical representation of a set. Anyone who has worked with spreadsheets has seen this kind of graphical representation before.

A set is a collection of entities.  So then what is an entity?  An entity is a representation of a “real” thing or item.  I put real in quotes because real doesn’t have to be a physical object or an actual object that really exists though it often is.  In this set the people are conjured up to be examples.  They aren’t real people though they could be.  They represent the idea of a real person but not an actual person.  Another example is a sale.  You may record all of your sales in a database but a sale isn’t a tangible thing.  Your company can sell a tangible thing, take tangible money for that thing and transfer that thing to a tangible person.  You can even record the sale on a tangible receipt.  However the sale itself isn’t tangible even though it did take place and can be recorded in a database.

Each entity should have some piece of information that uniquely identifies it and makes it distinct from the other entities.  In this limited example first and last name handle this duty.  While in a real database you wouldn’t want to use just first and last name for this (because it is pretty easy to find two people with the same first and last name), for understanding the concept and in this limited example using first and last name is fine.

Each entity can also have attributes that describe it. In our example each person has an email address and a phone number.  So in short, an entity is a representation of a “real” thing that can be distinctly identified by some piece of information and has, associated to it, attributes which describe the entity.

I will note here that some people use the term entity to refer to a collection of things and use the term instance when talking about an individual thing in that collection.  I think that is fine but it is often counterintuitive to someone just starting out.  As this point we don’t want to get bogged down in a war about terminology. The point is to understand the concept enough to understand how SQL works.  Just know that when I used the term entity I am talking about a single thing not a collection.  I use the term set for a collection of entities.

A set is a collection of zero or more entities and this collection is organized into rows.

Rows

Every row in this example set is about a single entity. In this case the thing that the row is about is a person. Each row identifies a specific person and has information about that person.  The first row is a woman named Cassidy Griffin, whose email address is casidy22@adventure-works.com and whose phone number is 999-555-0198.  Beside her name there is the number 6046.  This is a unique semi-arbitrary number assigned specifically to her.  You can ignore this for now.  I will explain what that is for later.
The forth row is Marcus A. Powell and the email address and phone number in this row belongs to Marcus.  In general you can consider each row as a statement of fact.  There is a person named Cassidy Griffin and this row is info about her.  And there is a person named Marcus A. Powell and this row is info about him.

One of the important things to recognize is that each row represents a single entity.  Information about Cassidy doesn’t bleed over into the row that holds information about Marcus and vice-versa.  You can separate the row about Cassidy from the row about Marcus and you don’t lose any information about either.  This is important when you start talking about operations against sets.

Columns

A column is a cross section of a set for a specific attribute.  Looking at the email address column from the example above we can find all of the email addresses for every entity (or person) in our set. Recall that I said that each row is a statement of fact about something in the world.  If that is true then the columns of the set determine what information makes up that statement of fact.  As you add columns to a set you are adding more statements of fact about each entity in the set.   I could add an Address column to the set above and we could then record as a statement of fact, the address of each person.

The columns determine what type of information is going to be stored for each row and each column is type specific.  A phone number isn’t stored in a FirstName column, nor is a FirstName stored in an EmailAddress column.  Phone number columns store phone numbers only.  For each row and each column there is only one value stored.  In other words, if Cassidy has two email addresses you wouldn’t put both of them into the EmailAddress column.  If she has two phone numbers you don’t put two phone numbers in the Phone column.  Each column for each row should hold a single value. A caveat for this is that the value stored can be a value of a complex type, such as an XML string.  An XML string has a lot of different values but as a whole the XML string can be considered a single value.

The Shape of a Set

So, sets are collections of rows that all have the same columns and the columns all hold the same type of information.  I’m going to call this the shape of the set.   In our example above the set has the columns ContactID, FirstName, MiddleName, LastName, EmailAddress and Phone.   If you have two sets with the same columns of the same types then these sets have the same shape.  If you add a column to a set then you have changed the shape of the set.

A set can have zero rows or a set can have an infinite number of rows. The set in the picture above has 10 rows.  This set has 0 rows.

Set with zero rows

The number of rows that the set has doesn’t affect its shape.

What can you do with sets?

With sets you can do some very interesting things.
We can add sets together.
This set:

added to this set:

results in this set:

An example of a set

We can subtract one set from another.
This set:

An example of a set

minus this set:

results in this set:

Remember when I said that the information about one person is contained in one row and that information doesn’t bleed into another row?  The containment of that information in one row allows us to subtract sets like this.   I can subtract the rows of one set from another and not affect the remaining rows.

You can also look for where sets intersect.
The intersection of this set:

and this set:

result in this set:

The result is the two rows the sets have in common.

In all of the previous examples it is important to understand that when adding, subtracting or showing the intersection of two sets the comparison is based on all columns for a single row.  Let’s look at the intersection example again.  It will look like the previous intersection example but in the second set David Williams has a middle name of Ray.

The intersection of this set:

and this set:

results in this set:

Why does this intersection result in a single row only when the previous example resulted in two rows remaining when the two sets being checked for intersection look the same.  Because when adding, subtracting or intersecting sets all of the columns for each row are used to determine if a row in one set is the same as a row in another set.  In this example only Riley Brooks is common between the two sets being compared for where they have the same rows.   The David Williams row in each set is no longer the same because the middle name is different.

You can join sets.

This set:

An example of a set

joined to this set on ContactID:

gives you this set:

In this example I took each row from the first set, compared it with each row from the second set on a specific column and if the values matched the combination of matching rows became part of a new set.  For example ContactID of 6046 in the first set is the same as ContactID 6046 in the second set.  Since these match I will put these rows together end to end as on big row.  I do this for every row in each set until I have all the rows that match.

This is where an understanding the shape of set comes into play.  When doing addition, subtraction and intersection all of the sets involved have to have the same shape, i.e. the same columns that hold the same data types.  Addition, subtraction and intersection operations produce a new set but that set has the same shape as the older sets.  However, with joins I don’t require any of the sets to have the same shape.  The result of a join is often a set with a completely different shape than the old sets. The shape of the set in this join example has been simplified to contain all of the columns of each set however the shape of a reulting set can be dictated.

Next post I will talk about T-SQL commands that help us dictate the shape and content of a result set, SELECT and WHERE.

 

Getting Started with SQL on SQL Server (Part 1)

This is a blog post that I’ve tried to write three times now.  My intent is to provide someone new to SQL (and SQL Server) enough background and theory about SQL and the relational model that makes their transition to using SQL in their job easier than mine was but not so much theory they get overloaded and fall into a state of analysis paralysis.  There are a couple of reasons I have found this difficult to write.

  1. There is a lot of theory behind SQL that has to be glossed over in a guide that is intended to help someone to get their brains around SQL.  What should and should not be covered is, for me, a difficult decision to make.  Too much theory and it can’t be applied immediately.  Not enough theory and lessons are immediately applicable but it harder to know how to expand on them for different but related uses.
  2. SQL and the relational model aren’t the same.  They use different terminology and SQL will allow you to do things that aren’t preferred in the relational model.
  3. To go from practically zero knowledge about SQL to being able to query a database and reliably get the data back that you want requires quite a bit of upfront knowledge.
  4. In the relational database world many of the words used: entity, column, attribute, row, record, set, etc. have a nuanced definition that often depend on who you talk to.  Some are used interchangeably or synonymously by some people while others insist on strict usage of certain words for certain phenomena.  A surprising number of flame wars revolve around something so innocuous.  It is easy to get bogged down by these nuances in an attempt to please all parties.  But in doing that what was intended to be a short series of blog posts becomes a book.

To write this I had to make some decisions about what terminology to use and what theory to reveal.  In general, I have fallen to using the terminology that is used in SQL Server, table instead of relation, row or record instead of tuple, column instead of attribute, etc. In a few places I throw in my own unique vernacular as well.  It is probably things like this that got me banned from speaking at PASS.  I kid, they didn’t actually ban me.  My invitation got “lost” in the mail. C’est la vie.

The book Pro SQL Server 2008 Relational Database Design and Implementation is an excellent resource to get a better grasp on the nuances of relational vocabulary and other relational theory.  CJ Date’s book SQL and Relational Theory: How to Write Accurate SQL Code is another good source of info about relational theory and he often expounds on when relational theory and SQL differ.

However we aren’t designing a database but learning how to query one and those nuanced distinctions, while good to know, aren’t necessary or possible for a blog post sized guide.  In practice I’ve heard renowned members of the SQL Server community use many of these terms interchangeably and have yet to meet someone on the job who cares if I call a row a record or a column an attribute.  If you are a newbie to SQL Server and you run into one of the few pedantic souls who cares, don’t worry about it.  If you work for one of them you can always quit. It can also be fun to intentionally use the “wrong” word just to tweak them. You can almost always tell from the context what someone means even if they don’t use the “correct” word.

Why I am writing this.

There are some things that I wish I had been told when I first got into databases that would have made the job much easier.  For people who have been working with SQL Server for years this information will be elementary and good to read if it is late and you are out of sheep ( uh…for counting, get your mind out of the gutter.)  It’s the kind of thing you will learn by necessity as you, by trial and error, try to extract data from the database using SQL.  But for those who haven’t been using SQL for years I hope this little introduction will make the move into querying a SQL Server database a bit easier.

So this will go into the category of “Things I wish someone had told me.”  It is very basic stuff but most of us don’t start out writing great SQL queries on our first day just like we don’t start reading Shakespeare when we are four.

My first real post in this series will be about sets, rows and columns.

The danger of NOT IN (Or, you keep using that word. I do not think it means what you think it means.)

I hinted about this in my post on anti-joins but I thought it deserved a little more attention, especially since this SQL construct is one I’ve seen used many times and, in the past, I used it quite liberally. The danger is that when doing an anti-join with NOT IN there is a risk that SQL will give you unexpected results.

First it helps to understand what an anti-join is. Basically you are telling SQL server to give you all of the rows from one result set that don’t have a match in another result set. My previous post has more details.

However I’m not going to start with an anti-join example but with an example of IN and NOT IN because this makes the anti-join example easier to understand.

First I need to create some data to work with.

CREATE TABLE dbo.Person
(
Name varchar(20),
AddressID int
)

CREATE TABLE dbo.Address
(
AddressID int,
Address varchar(20)
)

INSERT INTO dbo.Person (Name,AddressID) VALUES ('Wesley',1)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Buttercup',1)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Inigo',2)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Fezzik',2)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Rugen',3)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Humperdinck',3)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Vizzini',4)

INSERT INTO dbo.Address (AddressID,Address) VALUES (1,'Farm')
INSERT INTO dbo.Address (AddressID,Address) VALUES (2,'Theives'' Forest ')
INSERT INTO dbo.Address (AddressID,Address) VALUES (3,'Castle')
INSERT INTO dbo.Address (AddressID,Address) VALUES (null,'Dungeon')

The two tables look like this:
Person:

Address:

The Person table has 7 rows.  AddressID contains the values 1,2,3,4.
The Address table has 4 rows. AddressID contains the values 1,2,3, and for one of the rows this value is null.
Continue reading The danger of NOT IN (Or, you keep using that word. I do not think it means what you think it means.)

Semi-joins and Anti-joins

I orignally started this post for T-SQL Tuesday but wasn’t able to complete it in time. Being late I can see what other wrote about and it appears that semi-joins and anti-joins were a popular topic.  A while back I did a post on joins but didn’t talk about these. Since I’m teaching a class about joins I am finishing it up anyway so I can refer people to it.

A couple of join types that are commonly used but rarely explicitly labeled are semi-joins and anti-joins.  In a regular join you can get multiple results back for a single row because a row in Table A may match more than one row in Table B.  However with a semi-join you are only getting one row back from Table A regardless of how many matches there are for that row in Table B.

Just like there are right and left outer joins there are right and left semi-joins.
I have two tables, People and Employers.  People has 5 records.

Employer has 10000+ records.  The top 5:

When I originally wrote this post my Employer table only had 5 records as well but I couldn’t produce all the examples I wanted with that so I add 10,000 records or so to the table.  I did this by cross joining AdventureWorks table Purchasing.Vendor with itself.
Continue reading Semi-joins and Anti-joins