Statistics (Part 6) – Computed Columns

This topic is unbelievable dry. Dryer than ancient Egyptian pharaoh bone dust dry. Dryer than the surface of the sun dry. The layman might say “you’re talking about SQL Server and you expect it to be interesting?” Fair point. Maybe the topic of SQL Server is dry to the laymen but almost every topic is interesting to someone. Some people collect thimbles or bugs. Some people juggle geese. And as interested in SQL Server as I am this particular aspect of stats has got to be the driest I’ve ever written about. Death Valley dry. Dryer than Aunt Edna’s thanksgiving turkey dry.

However I feel it is necessary to include this information as part of the series. Please pardon me if I try to spice it up a bit. I already shorted out one keyboard from the boredom induced drool. This builds on my previous and hopefully less boring series on stats.

So…stats on computed columns. You can do this. Even if the column isn’t persisted. There is a caveat though. The formula for the column must be deterministic.

What does it mean to be deterministic? Wikipedia defines determinism as “a metaphysical philosophy stating that for everything that happens there are conditions such that, given those conditions, nothing else could happen.” In computer science this means that given the same input you will always have the same output. In a deterministic universe if we went back 14 billion years and reran the big bang I’d still be typing this boring topic and you would have still stopped reading this two paragraphs ago (see what I did there?)

It would be impossible to generate statistics on a column that is created from a non-deterministic expression. Imagine a column that was based on the result of RAND(). Every select would produce a new value on each row. Stats would be wrong the minute they are created. So SQL Server won’t let you create stats on non-deterministic computed columns.

However given the predictability of determinism (this is actually the third time you’ve read this in the last 42 billion years you just don’t remember the first two times) SQL Server is easily able to create statistics on a computed column, even if the computed column isn’t persisted. If the computed column C is the sum of column A and column B then the computed column will always be A + B even if C is never saved on disk.

According to the documentation the optimizer can make estimates even if the predicate contains the formula the computed column is calculated from instead of the column name. I will test that too. Fun.

So here is the table I created for this test and the batch I used to fill it with approximately 13,500 rows of data:

CREATE TABLE dbo.ComputedColumns
(
    ID int IDENTITY(1,1) NOT NULL,
    Col1 int NOT NULL,
    Col2 int NOT NULL,
    Col3 varchar(10) NOT NULL,
    Comp1 AS (ISNULL(Col1,0) + ISNULL(Col2,0)),
    Comp2 AS Col3 + ' ' + CONVERT(varchar(10),GETDATE(),114),
    Comp3 AS (Col1 + Col2) * .5,
    Comp4 AS Col1 * Col2,
)

GO

--load up some data
DECLARE @i int
DECLARE @Col3 varchar(10)

SET @i = 0

WHILE @i < 13456
BEGIN
SELECT @Col3 = (CHAR(FLOOR((RAND() * 26.0)) + 65)
              + CHAR(FLOOR((RAND() * 26.0)) + 65)
              + CHAR(FLOOR((RAND() * 26.0)) + 65)
              + CHAR(FLOOR((RAND() * 26.0)) + 65)
              + CHAR(FLOOR((RAND() * 26.0)) + 65)
              + CHAR(FLOOR((RAND() * 26.0)) + 65)
              + CHAR(FLOOR((RAND() * 26.0)) + 65)
              + CHAR(FLOOR((RAND() * 26.0)) + 65)
              + CHAR(FLOOR((RAND() * 26.0)) + 65)
              + CHAR(FLOOR((RAND() * 26.0)) + 65))

INSERT INTO dbo.ComputedColumns(Col1,Col2,Col3)
VALUES (RAND() * 10000, RAND() * 10000,@Col3)

SET @i = @i + 1
END

GO

Now to create some stats.

CREATE STATISTICS Comp1Stat ON dbo.ComputedColumns(Comp1)
CREATE STATISTICS Comp2Stat ON dbo.ComputedColumns(Comp2)
CREATE STATISTICS Comp3Stat ON dbo.ComputedColumns(Comp3)
CREATE STATISTICS Comp4Stat ON dbo.ComputedColumns(Comp4)

If you actually try to create the stats on Comp2 you should get an error stating:
Msg 2729, Level 16, State 1, Line 1
Column ‘Comp2’ in table ‘dbo.ComputedColumns’ cannot be used in an index or statistics or as a partition key because it is non-deterministic.

So SQL Server won’t let us create stats on a non-deterministic computed column. That is good to know.

Let’s take a look at the stats for the fourth computed column, Comp4. Yes, normally things like this start with 1 and go up but we’re trying to be spicy and interesting here, remember?

DBCC SHOW_STATISTICS('dbo.ComputedColumns','Comp4Stat')

If the predicate contains a constraint of 729362 against column Comp4 we should get an estimate of two rows.

SELECT * FROM dbo.ComputedColumns WHERE Comp4 = 729362


(Notice that I didn’t trim this screen shot well and there is some bleed through from the background. More spice!)

According to the documentation we should also be able to use the formula for the column in the predicate, like this:

SELECT * FROM dbo.ComputedColumns WHERE Col1 * Col2 = 729362

And it is 2 (yawn).

As an aside you might want to notice that the optimizer appears to have converted the query against Comp4 to the formula. The properties on the Table Scan operator in both examples show the predicate is the column formula and not the column name.

So all is well and good so far. Let’s query against Comp1. First let’s look at the stats.

DBCC SHOW_STATISTICS('dbo.ComputedColumns','Comp1Stat')

A query against the column Comp1 of 1937 should give us an estimate of 3 rows

SELECT * FROM dbo.ComputedColumns WHERE Comp1 = 1937

But it isn’t 3 rows. It is 1249.36? Where does this number come from? I explain that in a future blog post. The question for now is “why wasn’t the cardinality estimate 3 rows?” At this point you might wonder if the CONVERT_IMPLICIT has anything to do with it. It doesn’t. I manually converted it and the estimate is the same. It also doesn’t help to use the formula in the predicate either.

Let’s move on to computed column Comp3 and see what happens when I run a query against that column. First the stats.

DBCC SHOW_STATISTICS('dbo.ComputedColumns','Comp3Stat')


Then the query.

SELECT * FROM dbo.ComputedColumns WHERE Comp3 = 968.5

This appears to have worked.

And we should also be able to use the definition of the computed column in the predicate rather than the actual column name.

SELECT * FROM dbo.ComputedColumns WHERE (Col1 + Col2) * .5 = 968.5


Hmm, that doesn’t seem to work. We are getting that number, 1249.36, again.

If you look closely at the Table Scan properties you see that the text of the formula in the WHERE clause has been parameterized. There is an @1 in (Col1 + Col2) * .5 where the .5 should be making the formula (Col1 + Col2) * @1. Since I don’t have forced parameterization turned on in this database it must be using simple parameterization. By adding 1=1 to the predicate I can cause it to not use simple parameterization.

SELECT * FROM dbo.ComputedColumns WHERE 1=1 and (Col1 + Col2) * .5 = 968.5

It appears that the parameterization caused the estimate to be wrong. This fits with the documentation which says that if using the column formula in the predicate the definition in the predicate has to match the definition for the column. The predicate did not match the column after it was parameterized. This seems like a bug to me.

Now I’m curious of forced parameterization will parameterize the formula in the predicate. I’ll switch my database to forced parameterization and run the query again.

ALTER DATABASE TestDB SET PARAMETERIZATION FORCED
GO
SELECT * FROM dbo.ComputedColumns WHERE (Col1 + Col2) * .5 = 968.5
GO


It appears that forced parameterization does not result in the same cardinality estimate problem as simple parameterization does.

This doesn’t explain why Comp1 doesn’t use the stats and I have no good explanation either. It probably isn’t because SQL Server believes Comp1 to not be deterministic. We’ve already seen that stats won’t be created on a computed column whose definition is not deterministic. My intent though wasn’t to explicate everything there is to know about statistics with computed columns but to show that cardinality estimates on computed columns do work, why they sometimes don’t work (parameterization) and that sometimes they don’t work at all.

Hopefully you found this information helpful. In the next post we look at where the very odd estimate of 1249.36 came from. Right now though I’m going to find a giant humidor to climb into now to recover from the dryest topic I hope I ever have to write about. Next post will be about how optimizer came up with an estiamte of 1249.36 out of nowhere. Hint: it pulled it from its nether regions.

3 thoughts on “Statistics (Part 6) – Computed Columns

  1. Great series! Not bored yet :-)

    One tiny correction. Instead of “It would be impossible to generate statistics on a column that is created from a non-deterministic equation …” I would read “It would be impossible to generate statistics on a column that is created from a non-deterministic FUNCtion (or EXPRESSion)”.

    Keep up the awesome posts and, maybe, someday, I’ll finally master SQL enough to get bored :-)

  2. replacing Comp1 AS (ISNULL(Col1,0) + ISNULL(Col2,0)) with Col1+Col2 would give you correct results.

Comments are closed.