Statistics importance means one thing, PERFORMANCE. Let’s review shortly how this relation occurs.
1== You execute a query
2== The SQL query is passed to the SQL Server query optimizer
3== If an execution plan is already created for this query, it’s used, if not, a new one has to be created
4== Create a new execution plan means deciding if using a index seek, index scan or table scan for a logical search or using hash, merge and nested loop for a inner join or …
5== A number of factors influence when creating a new query execution plan, one of the most important is cardinality estimations. Others are number of CPUs, amount of RAM for example
6== Data distribution statistics on tables and indexes is from where cardinality estimates are obtained. If there were no statistics the optimizer will be forced to simply scan the table or index
7== Bottom line, with good, accurate statistics, query optimizer can get accurate cardinality estimates and create a good execution plan for a query. Final result, brilliant performance J
Note: Cardinality estimations/data distribution also have a big impact on parallelism, join-order and even memory allocation for query requests
Statistics real data
A single set of statistics gathers two different measures about the data, his density and his distribution.
Density shows how many unique values there are within a given column, or set of columns.
To calculate density we can use the following formula:
Density = 1 / Number of distinct values for column(s)
High density values (low distinct values used repeatedly through the column (s) – like a true/false column) will be of less use to the optimizer because it might not be the most efficient way of getting at your data. For example, for a table with high number of records a where clause that uses a bit column will lead the optimizer to use a full table scan instead of an index seek because it’s more efficient.
Data distribution does a statistical analysis of the kind of data a column has.
The statistical analysis is represented as/generates a histogram that’s a visual representation of the distribution of the data.
The histogram defines intervals of data (bins) and then represents data distribution into those intervals. This way you get a fairly good density estimation of data quickly and using few resources.
SQL Server histogram consists of 200 bins (interals of data) represented as 200 rows in a table.
The rows represent the way the data is distributed within the column by showing a pieces of data describing that distribution:
|RANGE_HI_KEY||Upper bound value of a histogram step i.e. the highest value in the step. The first step is the lowest value for the column in the table.|
|RANGE_ROWS||Number of rows from the sample that fall within a histogram step, excluding the upper bound. By definition, this must be zero in the first sample.|
|EQ_ROWS||Number of rows from the sample that are equal in value to the upper bound of the histogram step|
|DISTINCT_RANGE_ROWS||Number of distinct values within a histogram step, excluding the upper bound.|
|AVG_RANGE_ROWS||Average number of duplicate values within a histogram step, excluding the upper bound.
This is defined as:
(RANGE_ROWS / DISTINCT_RANGE_ROWS for DISTINCT_RANGE_ROWS > 0)
How – sample or full scan
As we mention earlier statistics holds information of data distribution in tables and indexes.
Statistics data can be created of two different ways, by sampled or through a full scan.
The sampled approach makes the generation and updates of statistics very fast, but can cause them to be inaccurate.
Sample mechanism randomly reads data through the table and then does calculations to arrive at the values for the statistics. This is accurate enough for most data set, but if you prefer you can make much more accurate statistics by manually updating or creating the statistics with a full scan.
When – automatically or manually
Statistics can be created or updated automatically or manually.
You can create statistics in SQL Server 2008 in a number of different ways, as described below:
– The optimizer automatically creates single-column nonfiltered statistics as needed as a side effect of optimizing SELECT, INSERT, UPDATE, DELETE, and MERGE statements if AUTO_CREATE_STATISTICS is enabled, which is the default setting
– There are two basic statements in SQL Server 2008 that explicitly generate the statistical information described above:
+ CREATE INDEX generates the declared index in the first place, and it also creates one set of statistics for the column combinations constituting the index keys (but not other included columns)
+ CREATE STATISTICS only generates the statistics for a given column or combination of columns.
Note: If the CREATE INDEX defines a predicate, the corresponding statistics are created with the same predicate.
In addition, there are several other ways to create statistics or indexes. Ultimately, though, each issues one of the above two commands
– Use sp_createstatsto create statistics for all eligible columns (all except XML columns) for all user tables in the current database. A new statistics object will not be created for columns that already have a statistics object.
– Use dbcc dbreindex to rebuild one or more indexes for a table in the specified database.
– In SQL Server Management Studio, expand the folder under a Table object, right click the Statistics folder, and choose New Statistics.
– Use the Database Engine Tuning Advisor to create indexes.
Note: The default statement when rebuilding an index generates new statistics automatically. This can be changed with the use of a parameter. When reorganizing indexes no statistics updates are made.
Statistics are updated automatically by default, and these are the three thresholds that automatically cause an update (the next time they are needed – to be more specific this process marks them as invalidated):
– The table size has gone from 0 to >0 rows
– The number of rows in the table when the statistics were gathered was 500 or less, and the colmodctr of the leading column of the statistics object has changed by more than 500 since then
– The table had more than 500 rows when the statistics were gathered, and the colmodctr of the leading column of the statistics object has changed by more than 500 + 20% of the number of rows in the table when the statistics were gathered
The default behavior presented above has some variations with filtered indexes and filtered statistics. When applying traceflag 2371 in SQL Server 2008R2 SP1 and in SQL Server 2012 you are able to set dynamic value instead of the fixed 20%, very helpful for larger databases in order to force a more frequent statistics update.
You can also update statistics manually this way:
– Running sp_updatestats that uses a cursor to walk through all database statistics. The process based on changes to column modification counters (colmodctrs) to update the statistics. Previous SQL Server 2008 if the rowmodctr flag was > 0
– Running UPDATE STATISTICS and so you can target individual statistics by providing his name. With this process you can specify the full scan option
Where are statistics stored
Statistics are stored in database.
That’s the reason why a backup/restore preserves statistics but bear in mind that this doesn’t guarantee the creation and use of a similar execution plan. The reason is that you surely have other factors that are not exactly the same, such as cpu number, memory, SQL version…
Check existing statistics
To displays the current distribution statistics for the specified target on the specified table:
DBCC SHOW_STATISTICS ( table_name | view_name , target ) [ WITH [ NO_INFOMSGS ] < option > [ , n ] ] < option > :: = STAT_HEADER | DENSITY_VECTOR | HISTOGRAM
The output generated:
– Header: which contains meta-data about the set of statistics
– Density: Which shows the density values for the column or columns that define the set of statistics
– Histogram: The table that defines the histogram laid out above