Semi-joins and Anti-joins

I orignally started this post for T-SQL Tuesday but wasn’t able to complete it in time. Being late I can see what other wrote about and it appears that semi-joins and anti-joins were a popular topic.  A while back I did a post on joins but didn’t talk about these. Since I’m teaching a class about joins I am finishing it up anyway so I can refer people to it.

A couple of join types that are commonly used but rarely explicitly labeled are semi-joins and anti-joins.  In a regular join you can get multiple results back for a single row because a row in Table A may match more than one row in Table B.  However with a semi-join you are only getting one row back from Table A regardless of how many matches there are for that row in Table B.

Just like there are right and left outer joins there are right and left semi-joins.
I have two tables, People and Employers.  People has 5 records.

Employer has 10000+ records.  The top 5:

When I originally wrote this post my Employer table only had 5 records as well but I couldn’t produce all the examples I wanted with that so I add 10,000 records or so to the table.  I did this by cross joining AdventureWorks table Purchasing.Vendor with itself.

An example of a left semi-join:

SELECT P.Name, P.Job, P.EmployerID
FROM People P
WHERE EXISTS (SELECT 1 FROM Employers E WHERE E.ID = P.EmployerID )

SQL Server even recognizes that you are doing a semi-join.

This could also be written using IN.  SQL Server produces the same query plan.

SELECT P.Name, P.Job, P.EmployerID
FROM People P
WHERE P.EmployerID IN (SELECT E.ID FROM Employers E)

A wise individual would notice that I could have used an inner join and got the same results because the column EmployerID in my People table would restrict it from matching more than one row in the Employer Table.  This is only true if there is only one matching row in the Employer Table.  As it is for this example there is no unique key on the ID field of the Employee table so we could technically match more than one row.  However, if you put appropriate indexes on these two tables the query optimizer does changes this two an inner join.  The plan of the previous query with some indexes.

Now I will remove the indexes so I can get the plans I’m looking for. What if we turned the example around?  Instead of finding all of the People that have a matching row in the Employer table, let’s find all of the Employers that have a matching row in the People table.

SELECT E.ID,E.Name,E.Address
FROM Employers E
WHERE EXISTS (SELECT 1 FROM People P WHERE E.ID = P.EmployerID)

This is a better example of a semi-join because an inner join wouldn’t produce the result set without duplicates. There are two employees that match to Electronic Bike Repair & Supplies on ID 2 but there is only one row in the result set.  Had this been a inner join there would be two rows in the for Electronic Bike Repair and Supplies in the result set.

Notice the query plan.

It is now using a right-semi join instead.  This is why I had to change my sample data and remove the indexes.  I wanted to get an example of a right semi join but the small sample set I was using wouldn’t do it. I needed to trick the query optimizer.   I can use IN here too and get the same plan

SELECT E.ID,E.Name,E.Address
FROM Employers E
WHERE E.ID IN (SELECT P.EmployerID FROM People P)

How about anti-joins?  Anti-joins return the rows of a table that don’t have a matching row in another table.

SELECT P.Name, P.Job, P.EmployerID
FROM People P
WHERE NOT EXISTS (SELECT 1 FROM Employers E WHERE E.ID = P.EmployerID)

Once again SQL Server recognizes what we are doing and identifies it as an anti-Join.

 

Once again SQL Server recognizes what we are doing and identifies it as an anti-join.

How about using NOT IN instead of NOT EXISTS

SELECT P.Name, P.Job, P.EmployerID
FROM People P
WHERE P.EmployerID NOT IN (SELECT E.ID FROM Employers E)

You might think that would work but there is a problem with it.  Since you are looking for records that don’t have a match it is comparing a NULL value in People.EmployerID with Employer.ID.  If you don’t know what happens when you try to do comparisons on NULLS you should look it up.  The warning here is: be careful using NOT IN.

When doing anti semi-joins if you have proper foreign keys enabled you can take a short cut and simply search for NULL where the EmployerID would be (assuming you allow NULLS in your database). This is because with proper foreign keys we can ensure that our EmployerID values have to exist in the Employer table.  Otherwise there is a risk of having a value in EmployerID that doesn’t actually exist in People.

SELECT P.Name, P.Job, P.EmployerID
FROM People P WHERE EmployerID IS NULL

However this only works when doing a anti semi-join on People.

Now lets turn this around and do an anti semi-join on Employer

SELECT E.ID,E.Name,E.Address
FROM Employers E
WHERE NOT EXISTS (SELECT 1 FROM People P WHERE P.EmployerID = E.ID)

Here are just a few of the 10000 rows.

And the query plan gives us a right anti semi-join.

Using NOT IN here causes us to run into the same problem as before. Since there is a NULL value in the EmployerID column of the People table we would be comparing real values to NULL.  We could qualify the sub-query though to exclude NULLS.

SELECT E.ID,E.Name,E.Address
FROM Employers E
WHERE E.ID NOT IN
(SELECT P.EmployerID FROM People P WHERE P.EmployerID IS NOT NULL)

That’s too much work though when there are better options but it is good to have it in your pocket.  You never know when you might need some trick you’ve learned.

Aaron Bertrand did a post for T-SQL Tuesday that looks at the a few other options and their performance.  It is worth a read.