Tuesday, August 6, 2013

All about Statistics [ Part – I ]

Statistics


Introduction

To improve the performance of any query the query optimizer uses the statistics to create query plan.  In most of the case the query optimizer generate the necessary statistics for high quality query plan. In few cases we need to create the additional statistics or modify the query design.

What is Statistics
Statistics for query optimization are objects that contain statistical information about the distribution of values in one or more columns of a table or indexed view.

Query optimizer uses this statistical information to estimate the number of rows, in the query result set and by this way the optimizer to create a high-quality query plan.

Query optimizer uses this cardinality estimation to choose the index seek operator instead of the index scan operator, and in doing so improve query performance.

How we can Update Statistics

We can update statistics by using UPDATE STATISTICS or sp_updatestats
But Microsoft recommended that to keeping AUTO_UPDATE_STATISTICS set to ON so that the query optimizer continues to routinely update statistics.

To determine when the statistics where last update use the 
STATS_DATE ( object_id , stats_id )

Example

SELECT name AS index_name,
    STATS_DATE(object_id, index_id) AS statistics_update_date
FROM sys.indexes
WHERE object_id = OBJECT_ID('dbo.tbl_CUSTOMER');
GO

index_name                                                   statistics_update_date
PK__tbl_CUST__3D9DDA317BE3C753     2014-04-04 17:31:24.810

When to Update Statistics

We have to update the statistics in the following condition

·         Query execution times are slow.
·         Insert operations occur on ascending or descending key columns.
·         After maintenance operations.
Create Statistics
In MS SQL Server the statistics can be created by using CREATE STATICSTIC command or by CREATE INDEX command.

The statistical information created using CREATE STATISTICS command is equivalent to the statistics built by a CREATE INDEX command on the same columns.

The only difference is that the CREATE STATISTICS command uses sampling by default while the CREATE INDEX command gathers the statistics with full scan since it has to process all rows for the index anyway.

Example

CREATE STATISTICS [IX_STATS_CUSTID]
ON [dbo].[tbl_CUSTOMER]([CUSTID])
WITH SAMPLE 50 PERCENT;

As we are here using 50% sampling of the row, for any big table random sampling may not produce the actual statistics. Therefore, for bigger tables, we may need to use the resample option on UPDATE STATISTICS. The resample option will maintain the full scan statistics for the indexes and sample statistics for the rest of the columns.

The statistical information is updated when 20% of the rows changed.

Update Statistics
Updating of statistics ensures that any query that runs get the up-to-date statistics to satisfy the query needs. This is introduced in MS SQL 2005 version and it is similar to thesp_updatestats command.

UPDATE STATISTICS [dbo].[tbl_CUSTOMER]
WITH FULLSCAN, ALL
GO

All about Statistics [ Part – I ]

Introduction
I got a query from one of my friends. He is asking about when to use sp_updatestats and how frequently we can use it. To answer his query, I decide to write some words related to Statistics. Hope it will be informative for all of us.

Use of Statistics Objects
We all know about the important of statistics object to make efficient execution plan by optimizer. If the statistics object is not updated properly causes poor execution plan and down the performance factors of query.

When the Statistics Object Created
When we are creating the Index the Statistics object is created automatically called the Index Statistics. These statistics will exist as long as the index exists.

Second, assuming the database option Auto Create Statistics is enabled, which it is by default, SQL Server will create single-column statistics whenever a column, which is not already the leading column in an existing index, is used in a query predicate (e.g. in the search condition of a WHERE clause, or in a JOIN condition). We refer to these as Column Statistics. We can also use the CREATE STATISTICS command to create single- and multi-column statistics manually.

When the Statistics Object Updated
When we Insert/Update/Delete records from Table objects the SQL Server automatically Insert/Update/Delete corresponding rows in the Index.

 But the Statistics is not updated like this, here we mean to say that when we Insert/Update/Delete the records from Table objects the Index updated automatically but Statistics not. Although the Auto Update Statistics option of the Database is enabled.
So we have to understand when the statistics is updated. It depends on certain volume threshold.

As data changes in our tables, the statistics - all the statistics - will be updated based on the following formula:

·         When a table with no rows gets a row
·         When 500 rows are changed to a table that is less than 500 rows
·         When 20% + 500 are changed in a table greater than 500 rows

Every time we modify a record in a table, SQL Server tracks it via the rcmodified column in a hidden system table. SQL Server 2005 tracked this information in the sys.rowsetcolumns table, In SQL Server 2008 (and later) sys.rowsetcolumns merged with sys.syshobtcolumns and became sys.sysrscols.

When we create or rebuild (not reorganize, just rebuild) an index, SQL Server generates the statistics with a FULLSCAN, i.e. it scans all the rows in the table to create a histogram that represents the distribution of data in the leading column of the index. Likewise, SQL will auto-create column statistics with a full sample.

When to Update Statistics Manually

Suppose we have table objects of 100 millions of records and SQL server is going to Update the Statistics Objects when 20% of the records of 100 million is effected by Insert/Update/Delete. So we have to wait long for Statistics Object update and result is poor exaction plan creation by SQL server. In this situation we have to update the statistics object manually.

In cases where we know data distribution in a column is "skewed", it may be necessary to update statistics manually with a full sample, or create a set of filtered statistics, in order to generate query plans of good quality.

Examining the Statistics
                            
sp_helpstats 'tbl_CUSTOMERDTLS', 'ALL'

statistics_name                     statistics_keys
---------------                   ----------------
_WA_Sys_00000001_03BB8E22           CUSTID
_WA_Sys_00000002_03BB8E22           CUSTNAME
_WA_Sys_00000003_03BB8E22           TOTALSALES
_WA_Sys_00000004_03BB8E22           GRADE
IX_CUSTID_tbl_CUSTOMERDTLS          CUSTID

The better approach

SELECT  [sch].[name] + '.' + [so].[name] AS [TableName] ,
        [si].[index_id] AS [Index ID] ,
        [ss].[name] AS [Statistic] ,
        STUFF(( SELECT  ', ' + [c].[name]
                FROM    [sys].[stats_columns] [sc]
                        JOIN [sys].[columns] [c]
                         ON [c].[column_id] = [sc].[column_id]
                            AND [c].[object_id] = [sc].[OBJECT_ID]
                WHERE   [sc].[object_id] = [ss].[object_id]
                        AND [sc].[stats_id] = [ss].[stats_id]
                ORDER BY [sc].[stats_column_id]
              FOR
                XML PATH('')
              ), 1, 2, '') AS [ColumnsInStatistic] ,
        [ss].[auto_Created] AS [WasAutoCreated] ,
        [ss].[user_created] AS [WasUserCreated] ,
        [ss].[has_filter] AS [IsFiltered] ,
        [ss].[filter_definition] AS [FilterDefinition] ,
        [ss].[is_temporary] AS [IsTemporary]
FROM    [sys].[stats] [ss]
        JOIN [sys].[objects] AS [so] ON [ss].[object_id] = [so].[object_id]
        JOIN [sys].[schemas] AS [sch] ON [so].[schema_id] = [sch].[schema_id]
        LEFT OUTER JOIN [sys].[indexes] AS [si]
              ON [so].[object_id] = [si].[object_id]
                 AND [ss].[name] = [si].[name]
WHERE   [so].[object_id] = OBJECT_ID(N'tbl_CUSTOMERDTLS')
ORDER BY [ss].[user_created] ,
        [ss].[auto_created] ,
        [ss].[has_filter];
GO

In next version we are trying to discuss more about it.

Hope you like it.