# Statistics (Part 5) – String Index

I’ve been talking about different aspects of stats and cardinality estimates. My last post is hiding here.

The String Index portion of a statistics object is sparsely documented. The existence of the string index in a stats object can be discovered by looking at the header output of DBCC SHOW_STATISTICS.

The documentation from Microsoft often refers to this as the String Summary.

The string index assists in cardinality estimates for substrings in predicates containing the LIKE clause with wildcard characters. It only does this for the leading column of a stats object and only for character data types, i.e. the string index is constructed when the leading column of a stats object is a character column.

The index contains substrings from the first and last 40 characters of a string. If the string is 80 characters long then the string index is created for all characters. If the string is 120 characters long the middle 40 characters would not be included in the string index.

To show this I created a test table with column that is 120 characters long. This column can logically be divided into three sections, the first 40 characters, the middle 40 characters, and the last 40 characters. In each of these sections I create a substring that is unique to that section and does not exist in the other sections. By querying the table with LIKE and wildcards I can see if the cardinality estimates for the middle section are different from the estimates for the beginning and ending sections. In theory the cardinality estimate for the middle section should be the same for predicate whose substring is not part of the string index. I also include several thousand rows that have no noteable substrings at all just to give the table soem mass.

CREATE TABLE StringTest (ID int IDENTITY(1,1) PRIMARY KEY NOT NULL, Col1 varchar(120))
INSERT INTO StringTest (Col1) SELECT REPLICATE('A',120)
GO 1000
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',10) +
REPLICATE('B',10) +
REPLICATE('A',100)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',10) +
REPLICATE('C',10) +
REPLICATE('A',100)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',120)
GO 1000
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',50) +
REPLICATE('D',10) +
REPLICATE('A',60)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',50) +
REPLICATE('E',10) +
REPLICATE('A',60)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',120)
GO 1000
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',90) +
REPLICATE('F',10) +
REPLICATE('A',20)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',90) +
REPLICATE('G',10) +
REPLICATE('A',20)
GO 30
INSERT INTO StringTest (Col1)
SELECT REPLICATE('A',120)
GO 1000

CREATE STATISTICS StringTestStat ON dbo.StringTest(Col1) WITH FULLSCAN
DBCC SHOW_STATISTICS('StringTest','StringTestStat')

In the first section there are 30 rows with the substrings ‘BBBBBBBBBB’ and ‘CCCCCCCCCC’, in the second section there are 30 rows with the substrings ‘DDDDDDDDDD’ and ‘EEEEEEEEEE’, and in the last section there are 30 rows with the substring ‘FFFFFFFFFF’ and ‘GGGGGGGGGG’. What we should see are cardinality estimates for the B string and C string similar to the cardinality estimates for the F string and the G string (giggle). But the cardinality estimates for the D and E strings should be similar to cardinality estimates for a substring that doesn’t exist.

Substrings in the first 40 characters:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%BBBBBBBBBB%'

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%CCCCCCCCCC%'

Substring in the middle 40 characters:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%DDDDDDDDDD%'

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%EEEEEEEEEE%'

Substrings in the last 40 characters:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%FFFFFFFFFF%'

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%GGGGGGGGGG%'

Substring that doesn’t exist:

SELECT * FROM dbo.StringTest WHERE Col1 LIKE '%SSSSSSSSSS%'

As predicted the estimates for the substrings in the first and last sections, the B and C substrings are similar to the estimates for the substrings in the last section, the F and G substrings. However the estimates for the middle section, the D and E substrings are similar to the estimates for substrings that do not exist in the column.

Given that there is so little documentation on this topic this is the best I can do to show how the string index can influence cardinality estimates.

My next post will be about statistics on computed columns.