Monthly Archives: November 2011

Getting Started with SQL on SQL Server (Part 2 – Sets, Rows and Columns)

The number one thing to understand about using SQL to get data from a database is how it operates against data and what it outputs when you are done.  To explain this I must first explain what sets, rows and columns are.  Once you know this you can start doing operations on those sets, rows and columns. You should read my first post if you have not already.  It explains the format I am taking, why I’m writing this, and some of the liberties I am taking in my terminology.

What is a set?

I find it is easier to understand sets if you have a picture to look at because it can give us an instinctual understanding of rows and columns.

An example of a set

This is a graphical representation of a set. Anyone who has worked with spreadsheets has seen this kind of graphical representation before.

A set is a collection of entities.  So then what is an entity?  An entity is a representation of a “real” thing or item.  I put real in quotes because real doesn’t have to be a physical object or an actual object that really exists though it often is.  In this set the people are conjured up to be examples.  They aren’t real people though they could be.  They represent the idea of a real person but not an actual person.  Another example is a sale.  You may record all of your sales in a database but a sale isn’t a tangible thing.  Your company can sell a tangible thing, take tangible money for that thing and transfer that thing to a tangible person.  You can even record the sale on a tangible receipt.  However the sale itself isn’t tangible even though it did take place and can be recorded in a database.

Each entity should have some piece of information that uniquely identifies it and makes it distinct from the other entities.  In this limited example first and last name handle this duty.  While in a real database you wouldn’t want to use just first and last name for this (because it is pretty easy to find two people with the same first and last name), for understanding the concept and in this limited example using first and last name is fine.

Each entity can also have attributes that describe it. In our example each person has an email address and a phone number.  So in short, an entity is a representation of a “real” thing that can be distinctly identified by some piece of information and has, associated to it, attributes which describe the entity.

I will note here that some people use the term entity to refer to a collection of things and use the term instance when talking about an individual thing in that collection.  I think that is fine but it is often counterintuitive to someone just starting out.  As this point we don’t want to get bogged down in a war about terminology. The point is to understand the concept enough to understand how SQL works.  Just know that when I used the term entity I am talking about a single thing not a collection.  I use the term set for a collection of entities.

A set is a collection of zero or more entities and this collection is organized into rows.

Rows

Every row in this example set is about a single entity. In this case the thing that the row is about is a person. Each row identifies a specific person and has information about that person.  The first row is a woman named Cassidy Griffin, whose email address is casidy22@adventure-works.com and whose phone number is 999-555-0198.  Beside her name there is the number 6046.  This is a unique semi-arbitrary number assigned specifically to her.  You can ignore this for now.  I will explain what that is for later.
The forth row is Marcus A. Powell and the email address and phone number in this row belongs to Marcus.  In general you can consider each row as a statement of fact.  There is a person named Cassidy Griffin and this row is info about her.  And there is a person named Marcus A. Powell and this row is info about him.

One of the important things to recognize is that each row represents a single entity.  Information about Cassidy doesn’t bleed over into the row that holds information about Marcus and vice-versa.  You can separate the row about Cassidy from the row about Marcus and you don’t lose any information about either.  This is important when you start talking about operations against sets.

Columns

A column is a cross section of a set for a specific attribute.  Looking at the email address column from the example above we can find all of the email addresses for every entity (or person) in our set. Recall that I said that each row is a statement of fact about something in the world.  If that is true then the columns of the set determine what information makes up that statement of fact.  As you add columns to a set you are adding more statements of fact about each entity in the set.   I could add an Address column to the set above and we could then record as a statement of fact, the address of each person.

The columns determine what type of information is going to be stored for each row and each column is type specific.  A phone number isn’t stored in a FirstName column, nor is a FirstName stored in an EmailAddress column.  Phone number columns store phone numbers only.  For each row and each column there is only one value stored.  In other words, if Cassidy has two email addresses you wouldn’t put both of them into the EmailAddress column.  If she has two phone numbers you don’t put two phone numbers in the Phone column.  Each column for each row should hold a single value. A caveat for this is that the value stored can be a value of a complex type, such as an XML string.  An XML string has a lot of different values but as a whole the XML string can be considered a single value.

The Shape of a Set

So, sets are collections of rows that all have the same columns and the columns all hold the same type of information.  I’m going to call this the shape of the set.   In our example above the set has the columns ContactID, FirstName, MiddleName, LastName, EmailAddress and Phone.   If you have two sets with the same columns of the same types then these sets have the same shape.  If you add a column to a set then you have changed the shape of the set.

A set can have zero rows or a set can have an infinite number of rows. The set in the picture above has 10 rows.  This set has 0 rows.

Set with zero rows

The number of rows that the set has doesn’t affect its shape.

What can you do with sets?

With sets you can do some very interesting things.
We can add sets together.
This set:

added to this set:

results in this set:

An example of a set

We can subtract one set from another.
This set:

An example of a set

minus this set:

results in this set:

Remember when I said that the information about one person is contained in one row and that information doesn’t bleed into another row?  The containment of that information in one row allows us to subtract sets like this.   I can subtract the rows of one set from another and not affect the remaining rows.

You can also look for where sets intersect.
The intersection of this set:

and this set:

result in this set:

The result is the two rows the sets have in common.

In all of the previous examples it is important to understand that when adding, subtracting or showing the intersection of two sets the comparison is based on all columns for a single row.  Let’s look at the intersection example again.  It will look like the previous intersection example but in the second set David Williams has a middle name of Ray.

The intersection of this set:

and this set:

results in this set:

Why does this intersection result in a single row only when the previous example resulted in two rows remaining when the two sets being checked for intersection look the same.  Because when adding, subtracting or intersecting sets all of the columns for each row are used to determine if a row in one set is the same as a row in another set.  In this example only Riley Brooks is common between the two sets being compared for where they have the same rows.   The David Williams row in each set is no longer the same because the middle name is different.

You can join sets.

This set:

An example of a set

joined to this set on ContactID:

gives you this set:

In this example I took each row from the first set, compared it with each row from the second set on a specific column and if the values matched the combination of matching rows became part of a new set.  For example ContactID of 6046 in the first set is the same as ContactID 6046 in the second set.  Since these match I will put these rows together end to end as on big row.  I do this for every row in each set until I have all the rows that match.

This is where an understanding the shape of set comes into play.  When doing addition, subtraction and intersection all of the sets involved have to have the same shape, i.e. the same columns that hold the same data types.  Addition, subtraction and intersection operations produce a new set but that set has the same shape as the older sets.  However, with joins I don’t require any of the sets to have the same shape.  The result of a join is often a set with a completely different shape than the old sets. The shape of the set in this join example has been simplified to contain all of the columns of each set however the shape of a reulting set can be dictated.

Next post I will talk about T-SQL commands that help us dictate the shape and content of a result set, SELECT and WHERE.

 

Getting Started with SQL on SQL Server (Part 1)

This is a blog post that I’ve tried to write three times now.  My intent is to provide someone new to SQL (and SQL Server) enough background and theory about SQL and the relational model that makes their transition to using SQL in their job easier than mine was but not so much theory they get overloaded and fall into a state of analysis paralysis.  There are a couple of reasons I have found this difficult to write.

  1. There is a lot of theory behind SQL that has to be glossed over in a guide that is intended to help someone to get their brains around SQL.  What should and should not be covered is, for me, a difficult decision to make.  Too much theory and it can’t be applied immediately.  Not enough theory and lessons are immediately applicable but it harder to know how to expand on them for different but related uses.
  2. SQL and the relational model aren’t the same.  They use different terminology and SQL will allow you to do things that aren’t preferred in the relational model.
  3. To go from practically zero knowledge about SQL to being able to query a database and reliably get the data back that you want requires quite a bit of upfront knowledge.
  4. In the relational database world many of the words used: entity, column, attribute, row, record, set, etc. have a nuanced definition that often depend on who you talk to.  Some are used interchangeably or synonymously by some people while others insist on strict usage of certain words for certain phenomena.  A surprising number of flame wars revolve around something so innocuous.  It is easy to get bogged down by these nuances in an attempt to please all parties.  But in doing that what was intended to be a short series of blog posts becomes a book.

To write this I had to make some decisions about what terminology to use and what theory to reveal.  In general, I have fallen to using the terminology that is used in SQL Server, table instead of relation, row or record instead of tuple, column instead of attribute, etc. In a few places I throw in my own unique vernacular as well.  It is probably things like this that got me banned from speaking at PASS.  I kid, they didn’t actually ban me.  My invitation got “lost” in the mail. C’est la vie.

The book Pro SQL Server 2008 Relational Database Design and Implementation is an excellent resource to get a better grasp on the nuances of relational vocabulary and other relational theory.  CJ Date’s book SQL and Relational Theory: How to Write Accurate SQL Code is another good source of info about relational theory and he often expounds on when relational theory and SQL differ.

However we aren’t designing a database but learning how to query one and those nuanced distinctions, while good to know, aren’t necessary or possible for a blog post sized guide.  In practice I’ve heard renowned members of the SQL Server community use many of these terms interchangeably and have yet to meet someone on the job who cares if I call a row a record or a column an attribute.  If you are a newbie to SQL Server and you run into one of the few pedantic souls who cares, don’t worry about it.  If you work for one of them you can always quit. It can also be fun to intentionally use the “wrong” word just to tweak them. You can almost always tell from the context what someone means even if they don’t use the “correct” word.

Why I am writing this.

There are some things that I wish I had been told when I first got into databases that would have made the job much easier.  For people who have been working with SQL Server for years this information will be elementary and good to read if it is late and you are out of sheep ( uh…for counting, get your mind out of the gutter.)  It’s the kind of thing you will learn by necessity as you, by trial and error, try to extract data from the database using SQL.  But for those who haven’t been using SQL for years I hope this little introduction will make the move into querying a SQL Server database a bit easier.

So this will go into the category of “Things I wish someone had told me.”  It is very basic stuff but most of us don’t start out writing great SQL queries on our first day just like we don’t start reading Shakespeare when we are four.

My first real post in this series will be about sets, rows and columns.

Performance Based Management

Andy Leonard doesn’t care for Performance Based Management.  I tend to agree.  I find this approach to be passive aggressive.  If management has a problem with an employee they should confront the employee directly.  With PBM, management can slot people into categories and blame the results on the system.  “Sorry, I didn’t want to punish you but someone had to be in the bottom 20%. It’s out of my hands.”

And then there is the collateral damage which is what I think Andy is talking about.  Even if all employees are doing a fine job someone has to be in the bottom 20% and you can be sure it won’t be the head guys. When you build a house someone has to pick up the shovel and dig.  You can’t punish the guy with the shovel because he’s not the engineer.  If everyone who is building the house is doing a good job but someone has to be labeled as the bottom 20% you can bet it won’t be the engineer or the foreman.