Automatically Created Statistics - know the caveats

In the realm of data management, maintaining optimal query performance is paramount. Microsoft SQL Server provides a powerful feature—automatically generated statistics objects—that plays a crucial role in this endeavor. These statistics objects are vital for the Query Optimizer to make decisions about the most efficient execution plans, resulting in faster query performance and reduced resource consumption.

Inhaltsverzeichnis

What are automatically created statistics?
- Benefits
When they are created
Auto created statistics in large tables
Redundant Statistics

What are automatically created statistics?

Automatically generated statistics objects are created by SQL Server to gather metadata about the distribution of data within columns of tables. This metadata helps the query optimizer estimate the number of rows that will be returned by a query, leading to more accurate and efficient execution plans.

Automatically generated statistics differ from statistics generated manually or when creating indexes. You can usually identify them by the fact that they start with the prefix „_WA_sys„.

Benefits

Reduced manual effort because statistics don’t need to be created by the developers/database administrators
The Query Optimizer can choose the most efficient execution plan

When they are created

When a query is executed the first time and the Query Optimizer determines that statistics are needed for a column.

Auto created statistics in large tables

Statistics in Microsoft SQL Server are automatically created under the following conditions:

When a query references a column without existing statistics, SQL Server may automatically create statistics to help the query optimizer make better decisions.
If a query involves joins, filters, groupings, or other constructs that require data distribution information, and statistics are not available, SQL Server will create them.

These automatically generated statistics help ensure that the query optimizer has accurate and current information to produce efficient execution plans, leading to better query performance and resource utilization. That sounds cool. But it has a significant disadvantage when accessing a table object with a very large amount of data for the first time: first the statistics are created and then the query can be executed. And that can take some time. I use my demo database, which I use for my workshops and my conference sessions, as an example.

USE ERP_Demo;
GO

/* Schema in ERP_Demo for the blog post about auto created statistics */
CREATE SCHEMA [demo] AUTHORIZATION dbo;
GO

DROP TABLE IF EXISTS demo.lineitems;
GO

/* Create a demo table with 82.831.637 rows. */
SELECT	*
INTO	demo.lineitems
FROM	dbo.lineitems WITH (READPAST);
GO

/*
    Start the DISTINCT query to create a statistics object for every column
    Note: The DISTINCT forces SQL Server to create a stats object for every
          output column!
*/
SELECT	DISTINCT
		*
FROM	demo.lineitems
GO

This depiction shows an extract of the extended event session which recorded all activities when an object gets created / altered or dropped.
It shows for every column a create entry for the new statistics object. — extended event recording of the query

I have shortened the output of the extended event for better readability. However, the figure shows two things very clearly:

Creating the statistics took about 10 seconds.
After the statistics were created, the query was started

/* Get a list of all stats objects for the demo table */
SELECT	object_id,
		stats_id,
		stats_name,
		auto_created,
		stat_columns
/*
    The function is part of the framework of the demo database!
    GIST: https://gist.github.com/uricken1964/b3b31b33dadcbc3252ba2488d3c6961c
*/
FROM	dbo.get_statistics_columns_info(N'demo.lineitems', N'U');
GO

To list the statistics objects, use a function I wrote that is available in the framework of my demo database or can be loaded from my GIST repositories.

The red block represents the ordinal value of the attribute in the table as a hexadecimal value. Even if the example looks a bit „contrived“, it is clear that creating statistics takes time. The time will increase as there are more rows and columns in the table.

Note: „There is no lunch for free!“

The creation of automatic statistics can be avoided in DISTINCT queries by giving the table a unique key before the execution of the query. The Key Attribute must be included in the SELECT statement.

DROP TABLE IF EXISTS demo.lineitems;
GO

/* Create a demo table with 82.831.637 rows. */
SELECT	*
INTO	demo.lineitems
FROM	dbo.lineitems WITH (READPAST)
WHERE	1 = 0;
GO

ALTER TABLE demo.lineitems ADD CONSTRAINT pk_demo_lineitems PRIMARY KEY NONCLUSTERED
(
	l_orderkey,
	l_linenumber
);

INSERT INTO demo.lineitems WITH (TABLOCK)
SELECT * FROM dbo.lineitems WITH (READPAST);
GO


SELECT	DISTINCT
		*
FROM	demo.lineitems
GO

/* Get a list of all stats objects for the demo table */
SELECT	object_id,
		stats_id,
		stats_name,
		auto_created,
		stat_columns
FROM	dbo.get_statistics_columns_info(N'demo.lineitems', N'U');
GO

The presence of a unique key is sufficient for Microsoft SQL Server to determine that the index/constraint ensures that each row is unique.
This works for the following implementations:

PRIMARY KEY Constraint
UNIQUE Constraint
UNIQUE CLUSTERED INDEX
UNIQUE NONCLUSTERED INDEX

Redundant Statistics

Another unpleasant feature of automatic statistics is the possible redundancy of statistics objects when – for example – indexes or manual statistics objects are created. Microsoft SQL Server does not check whether a statistics object with the same definition already exists.

DROP TABLE IF EXISTS demo.customers;
GO

/* Insert 1.600.000 rows into the table */
SELECT *
INTO demo.customers
FROM dbo.customers WITH (READPAST)
GO

/* Execute the query to create the automatic statistics */
SELECT	DISTINCT
		*
FROM	demo.customers
GO

ALTER TABLE demo.customers ADD CONSTRAINT pk_demo_customers PRIMARY KEY CLUSTERED
(c_custkey);

/*
	Get a list of all stats objects for the demo table
	Note: Statistics 1 and 2 are identically!	
*/
SELECT	object_id,
		stats_id,
		stats_name,
		auto_created,
		stat_columns
FROM	dbo.get_statistics_columns_info(N'demo.customers', N'U');
GO

The above example creates the table [dbo].[customers] and immediately performs a SELECT. This creates the automatic statistics. Only then is a primary key created.

Pay attention to the two statistics objects that reference the attribute [c_custkey]! Identical statistics have no direct impact on the performance of a query, as the query optimizer recognizes the redundancy and always chooses the index statistics. However, two points should not be overlooked:

Maintenance tasks require more time to update all statistics!
Both statistics objects must be updated

/* Show the execution time for the update of statistics */
DBCC TRACEON (3604, 8721);
GO

/* Úpdate all statistics objects with FULLSCAN */
UPDATE STATISTICS demo.customers WITH FULLSCAN;
GO

The figure shows the elapsed time for updating each statistics object. In total, it takes about 14 seconds to update the statistics. — Two different statistics objects but same data!

The update of all statistics objects took app. 14 seconds and this table has only 1.6 Mio rows. Think big! What happens with your maintenance in a multi-billion table when the maintenance will update all modified statistics with a full scan?

Thank you very much for reading!

Automatically Created Statistics – know the caveats

What are automatically created statistics?

Benefits

When they are created

Auto created statistics in large tables

Redundant Statistics

Kommentar absenden Antworten abbrechen

db SQL Blog