The danger of NOT IN (Or, you keep using that word. I do not think it means what you think it means.)

I hinted about this in my post on anti-joins but I thought it deserved a little more attention, especially since this SQL construct is one I’ve seen used many times and, in the past, I used it quite liberally. The danger is that when doing an anti-join with NOT IN there is a risk that SQL will give you unexpected results.

First it helps to understand what an anti-join is. Basically you are telling SQL server to give you all of the rows from one result set that don’t have a match in another result set. My previous post has more details.

However I’m not going to start with an anti-join example but with an example of IN and NOT IN because this makes the anti-join example easier to understand.

First I need to create some data to work with.

CREATE TABLE dbo.Person
(
Name varchar(20),
AddressID int
)

CREATE TABLE dbo.Address
(
AddressID int,
Address varchar(20)
)

INSERT INTO dbo.Person (Name,AddressID) VALUES ('Wesley',1)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Buttercup',1)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Inigo',2)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Fezzik',2)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Rugen',3)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Humperdinck',3)
INSERT INTO dbo.Person (Name,AddressID) VALUES ('Vizzini',4)

INSERT INTO dbo.Address (AddressID,Address) VALUES (1,'Farm')
INSERT INTO dbo.Address (AddressID,Address) VALUES (2,'Theives'' Forest ')
INSERT INTO dbo.Address (AddressID,Address) VALUES (3,'Castle')
INSERT INTO dbo.Address (AddressID,Address) VALUES (null,'Dungeon')

The two tables look like this:
Person:

Address:

The Person table has 7 rows.  AddressID contains the values 1,2,3,4.
The Address table has 4 rows. AddressID contains the values 1,2,3, and for one of the rows this value is null.

Yes, typically you wouldn’t have an AddressID of null but this is contrived setup created for this example.

The first example is a semi-join to find all of the rows in Person with an AddressID that have a match on Address.AddressID. By eye it is very easy to spot the answer, the rows in Person that contains the values 1,2 or 3 in the AddressID.  Of course in our application we can’t do this by eye and we must do this programmatically.  Instead of doing a subquery in the IN clause I’m going to simply substitute the values that the subquery would return.

--So instead of this:
SELECT Name,AddressID
FROM dbo.Person
WHERE AddressID IN (SELECT AddressID FROM dbo.Address)

--I’m going to use this but without the null:
SELECT Name,AddressID
FROM dbo.Person
WHERE AddressID IN (1,2,3)

The result:

If we look at the execution plan for this we see that there is a table scan on the Person table with a predicate that looks something like this.

[dbo].[Person].[AddressID]=(1) OR
[dbo].[Person].[AddressID]=(2) OR
[dbo].[Person].[AddressID]=(3)

The OR is important here. If the value of AddressID matches either 1, 2 or 3 then this will return true and the row will be a match.

Let’s convert this to a NOT IN.

SELECT Name,AddressID
FROM dbo.Person
WHERE AddressID NOT IN (1,2,3)

The result:

The query plan also has a table scan on the Person table but now the predicate is this:

[dbo].[Person].[AddressID]<>(1) AND
[dbo].[Person].[AddressID]<>(2) AND
[dbo].[Person].[AddressID]<>(3)

Notice that now we are looking for inequality and what was an OR is now an AND. This means that for each AddressID in the Person table it has to not be equal to 1,2 and 3 for that row to be returned by the query. If any one of these comparisons doesn’t return true (notice I didn’t say returns false) then the entire statement is not true and there will be no match.

Suppose we change the NOT IN clause to (1,2,3,null). How would that change the predicate?
It would look something like this

[dbo].[Person].[AddressID]<>(1) AND
[dbo].[Person].[AddressID]<>(2) AND
[dbo].[Person].[AddressID]<>(3) AND
[dbo].[Person].[AddressID]<>(null)

What happens when you compare anything to null? The result of the forth line is neither true nor false but unknown. What happens with the 4th line returns unknown? Since we are using AND and not OR all lines must return true for the entire statement to be true. Since the forth line evaluates to unknown this entire statement would evaluate to not true for each AddressID in the Person table and no rows from the query would be returned.

This query:

SELECT Name,AddressID
FROM dbo.Person
WHERE AddressID NOT IN (1,2,3,null)

Returns this:

If you look at the actual execution plan generated by this query, the predicate that I crafted above that includes the null comparison isn’t actually in the query plan. My suspicion is that once the optimizer sees the null value it no longer has to do the comparison because it knows what the answer will be and can cut to the chase.

Let’s move this to the anti-join example using the sample data I created above.

SELECT Name,AddressID
FROM dbo.Person P
WHERE P.AddressID NOT IN (SELECT A.AddressID FROM dbo.Address A)

Since we know what the data looks like we know this will return no rows. However let’s look at the query plan produced and see what it does.

This is a nested loop join. Let’s remember what the nested loop is doing. It is getting each row from the Person table and comparing it to each row in the Address table one by one. This comparison is using a certain predicate on the nested loop Join operator (not on the table itself) to do this comparison. In this case the predicate on the nested loop is this:

[dbo].[Person].[AddressID] IS NULL OR
[dbo].[Address].[AddressID] IS NULL OR [dbo].[Person].[AddressID]=[AutoParam].[dbo].[Address].[AddressID]

If any of these are true then that row violates the NOT IN condition. But, you say, that doesn’t look like what we had before.

Reminder of what we had before
[dbo].[Person].[AddressID]<>(1) AND
[dbo].[Person].[AddressID]<>(2) AND
[dbo].[Person].[AddressID]<>(3) AND
[dbo].[Person].[AddressID]<>(null)

Logically they are doing the same thing. In the hardcoded example the AddressID in every row was compared to the value of 1,2,3 or null and had to NOT match all of them. Had any one of them matched (or been compared to null) then the entire statement is not true. Instead of looking to see if it doesn’t match all of them we can also look to see if one of the AddressIDs does match one of them. If it does match one of them then we know that AddressID in question is in the Address table and that row should be excluded from the anti-join result set. We also know that if it is compared against null the entire result set will be null. This simplified construct works because each row in the Person table is being compared to each row in the Address table, one by one, i.e the query processor doesn’t know what the future values to compare will be until the nested loop evaluates that row. Thus comparing to all values at once doesn’t work here.

While this scenario may appear contrived (since the apparent identity column on the address table contains a null) in real life I’ve see this happen when the subquery contains a derived table which is composed of two tables joined with an outer join. The outer join can cause some values to return null, which we now know will cause our NOT IN to puke if we point our NOT IN clause at one of the columns containing null values.

So how do we do this safely? One method is to use NOT EXISTS. The query above rewritten using NOT EXISTS looks like this:

SELECT Name,AddressID
FROM dbo.Person P
WHERE NOT EXISTS (SELECT AddressID FROM dbo.Address A WHERE A.AddressID = P.AddressID)

Which returns this: