The number one thing to understand about using SQL to get data from a database is how it operates against data and what it outputs when you are done. To explain this I must first explain what sets, rows and columns are. Once you know this you can start doing operations on those sets, rows and columns. You should read my first post if you have not already. It explains the format I am taking, why I’m writing this, and some of the liberties I am taking in my terminology.
What is a set?
I find it is easier to understand sets if you have a picture to look at because it can give us an instinctual understanding of rows and columns.
This is a graphical representation of a set. Anyone who has worked with spreadsheets has seen this kind of graphical representation before.
A set is a collection of entities. So then what is an entity? An entity is a representation of a “real” thing or item. I put real in quotes because real doesn’t have to be a physical object or an actual object that really exists though it often is. In this set the people are conjured up to be examples. They aren’t real people though they could be. They represent the idea of a real person but not an actual person. Another example is a sale. You may record all of your sales in a database but a sale isn’t a tangible thing. Your company can sell a tangible thing, take tangible money for that thing and transfer that thing to a tangible person. You can even record the sale on a tangible receipt. However the sale itself isn’t tangible even though it did take place and can be recorded in a database.
Each entity should have some piece of information that uniquely identifies it and makes it distinct from the other entities. In this limited example first and last name handle this duty. While in a real database you wouldn’t want to use just first and last name for this (because it is pretty easy to find two people with the same first and last name), for understanding the concept and in this limited example using first and last name is fine.
Each entity can also have attributes that describe it. In our example each person has an email address and a phone number. So in short, an entity is a representation of a “real” thing that can be distinctly identified by some piece of information and has, associated to it, attributes which describe the entity.
I will note here that some people use the term entity to refer to a collection of things and use the term instance when talking about an individual thing in that collection. I think that is fine but it is often counterintuitive to someone just starting out. As this point we don’t want to get bogged down in a war about terminology. The point is to understand the concept enough to understand how SQL works. Just know that when I used the term entity I am talking about a single thing not a collection. I use the term set for a collection of entities.
A set is a collection of zero or more entities and this collection is organized into rows.
Every row in this example set is about a single entity. In this case the thing that the row is about is a person. Each row identifies a specific person and has information about that person. The first row is a woman named Cassidy Griffin, whose email address is email@example.com and whose phone number is 999-555-0198. Beside her name there is the number 6046. This is a unique semi-arbitrary number assigned specifically to her. You can ignore this for now. I will explain what that is for later.
The forth row is Marcus A. Powell and the email address and phone number in this row belongs to Marcus. In general you can consider each row as a statement of fact. There is a person named Cassidy Griffin and this row is info about her. And there is a person named Marcus A. Powell and this row is info about him.
One of the important things to recognize is that each row represents a single entity. Information about Cassidy doesn’t bleed over into the row that holds information about Marcus and vice-versa. You can separate the row about Cassidy from the row about Marcus and you don’t lose any information about either. This is important when you start talking about operations against sets.
A column is a cross section of a set for a specific attribute. Looking at the email address column from the example above we can find all of the email addresses for every entity (or person) in our set. Recall that I said that each row is a statement of fact about something in the world. If that is true then the columns of the set determine what information makes up that statement of fact. As you add columns to a set you are adding more statements of fact about each entity in the set. I could add an Address column to the set above and we could then record as a statement of fact, the address of each person.
The columns determine what type of information is going to be stored for each row and each column is type specific. A phone number isn’t stored in a FirstName column, nor is a FirstName stored in an EmailAddress column. Phone number columns store phone numbers only. For each row and each column there is only one value stored. In other words, if Cassidy has two email addresses you wouldn’t put both of them into the EmailAddress column. If she has two phone numbers you don’t put two phone numbers in the Phone column. Each column for each row should hold a single value. A caveat for this is that the value stored can be a value of a complex type, such as an XML string. An XML string has a lot of different values but as a whole the XML string can be considered a single value.
The Shape of a Set
So, sets are collections of rows that all have the same columns and the columns all hold the same type of information. I’m going to call this the shape of the set. In our example above the set has the columns ContactID, FirstName, MiddleName, LastName, EmailAddress and Phone. If you have two sets with the same columns of the same types then these sets have the same shape. If you add a column to a set then you have changed the shape of the set.
A set can have zero rows or a set can have an infinite number of rows. The set in the picture above has 10 rows. This set has 0 rows.
The number of rows that the set has doesn’t affect its shape.
What can you do with sets?
With sets you can do some very interesting things.
We can add sets together.
added to this set:
results in this set:
We can subtract one set from another.
minus this set:
results in this set:
Remember when I said that the information about one person is contained in one row and that information doesn’t bleed into another row? The containment of that information in one row allows us to subtract sets like this. I can subtract the rows of one set from another and not affect the remaining rows.
You can also look for where sets intersect.
The intersection of this set:
and this set:
result in this set:
The result is the two rows the sets have in common.
In all of the previous examples it is important to understand that when adding, subtracting or showing the intersection of two sets the comparison is based on all columns for a single row. Let’s look at the intersection example again. It will look like the previous intersection example but in the second set David Williams has a middle name of Ray.
The intersection of this set:
and this set:
results in this set:
Why does this intersection result in a single row only when the previous example resulted in two rows remaining when the two sets being checked for intersection look the same. Because when adding, subtracting or intersecting sets all of the columns for each row are used to determine if a row in one set is the same as a row in another set. In this example only Riley Brooks is common between the two sets being compared for where they have the same rows. The David Williams row in each set is no longer the same because the middle name is different.
You can join sets.
joined to this set on ContactID:
gives you this set:
In this example I took each row from the first set, compared it with each row from the second set on a specific column and if the values matched the combination of matching rows became part of a new set. For example ContactID of 6046 in the first set is the same as ContactID 6046 in the second set. Since these match I will put these rows together end to end as on big row. I do this for every row in each set until I have all the rows that match.
This is where an understanding the shape of set comes into play. When doing addition, subtraction and intersection all of the sets involved have to have the same shape, i.e. the same columns that hold the same data types. Addition, subtraction and intersection operations produce a new set but that set has the same shape as the older sets. However, with joins I don’t require any of the sets to have the same shape. The result of a join is often a set with a completely different shape than the old sets. The shape of the set in this join example has been simplified to contain all of the columns of each set however the shape of a reulting set can be dictated.
Next post I will talk about T-SQL commands that help us dictate the shape and content of a result set, SELECT and WHERE.