This is a continuation of my blog series about statistics. The previous post is here.

So the question is what happens to your cardinality estimates when the column is nullable and contains many nulls (or is it more accurate to say doesn’t contain…regardless I think you know what I mean)?

Under this condition the face value of the density in the density vector is not used but adjusted to account for the presence of nulls in the column.

In part 2 of this series I created a table called AWStatTest which contains a subset of the columns from the Person.Address table in the Adventureworks database, AddressLine1, AddressLine2, City, and Postal Code.

The column AddressLine2 is nullable and, given the nature of addresses the second address line doesn’t always contain data. This is the column I will be using for my test.

The first step is to create a statistic object on this column which I will do so with this statement. I got into the habit of using full scans in my examples because it is often easier to understand how cardinality estimations work when the numbers predicted by the stats match the actual numbers in the table. This isn’t a blanket recommendation for full scans though.

And to see the stats it produced.

Recall from part 2 of this series that density is calculated by the formula 1/(number of unique values). The number of unique values in the column can found with this statement.

(

SELECT DISTINCT AddressLine2 FROM AWStatTest

) A

In our case the value returned is 39. So, 1/39 = .02564103 which is the density shown in the density vector above.

In order to test cardinality estimation under these conditions I will need to use a predicate which causes the optimizer to use the density value for estimation instead of the histogram. Comparing against a variable instead of a static value will force this.

SET @Addressline2 = '# 10'

SELECT * FROM dbo.AWStatTest WHERE AddressLine2 = @AddressLine2

Based on the knowledge gained in previous posts I would expect the cardinality estimate to be calculated from the density vector using the formula (Number or rows in the table) * (Density) or in our case (4040008 * .02564103.) which would be 103590 rows.

But the estimate produced by the optimizer is quite different: 1275.79

My next guess would be that the optimizer is using a magic density. But that formula (4040008^.75) produces an estimate of 90112.84, still way off from the estimate the optimizer produced.

If the database option ANSI_NULLS is ON then a null comparison via an equality statement will never resolve to true. Even a comparison of NULL = NULL does not resolve to true. Given this, the only way our predicate AddressLine2 = @AddressLine2 can be true is if the value of @AddressLine2 is not null and the value of this field in the row is also not null. In other words any row which has AddressLine2 of null would never be returned by this statement even if the value of @AddressLine2 is null.

So, what if we recalculated density and cardinality but disregarded nulls. There would be only 38 unique values in the columns instead of 39 so the density would be 1/38 or .02631578. The histogram shows there are 3991528 null values in AddressLine2. The total number of rows in the table minus the number of null entries should give us the number of non-null values in the column 4040008 – 391528 = 48480.

Using the new density and table row count to calculate cardinality would give us,

48480 * .02631578 = 1275.79, the same cardinality estimate produced by the optimizer.

This is with ANSI_NULLS turned on. What if ANSI_NULLS is disabled? With ANSI_NULLS a comparison of NULL = NULL would resolve true and all of the nulls in the column could be returned when using the equality predicate construct.

Running the same query again with ANSI_NULLS off produces a cardinality estimate of 103590 which is the estimate we came up with using the original density and full table row count.

Neat, huh?

Very neat . . . seems like a good scenario for creating a filtered statistic. Another option would be setting ANSI_NULLS off, but that seems like overkill give that we now have filtered statistics.

Your thoughts?

Bill, Sorry it took so long for me to respond. WordPress didn’t email me that I had a new comment.

I don’t think the behavior I described in this post, by itself, warrants any action. The overall goal of this series is to understand why SQL Server comes up with the estimates it does. In this case I like the behavior. If I’m using an equality constraint I don’t want the nulls to be considered in the estimate. This should give us a better cardinality estimate and, in theory, a better query plan.

A statistic filtering out the nulls would remove them from the estimate, which is what SQL Server is already doing for us. So I don’t know how that would benefit us here. SET ANSI_NULLS OFF has been deprecated since 2005 so its use should be avoided.