Staging, Statistics & Common Sense
Oracle Statistics Maintenance Strategy in an ETL environment
Statistics Maintenance strategies are a pretty dry subject, but pretty important to guarantee SLA’s and consistent performance in an Oracle environment. This paper presents a particular strategy for dealing with statistics in a Staging area where joins are being used across the staging tables which force the need for a good level of statistics to be in place.
Most ETL applications use a staging area to stage source system data before loading it into the warehouse or marts. When implemented within an oracle environment a partitioning strategy is usually employed such that data that is not required any longer can be removed from the tables with theminimum amount of effort.
However, what sort of statistics maintenance strategy have your DBA’s or ETL Architects implemented for the staging area? Have they left everything defaulted and use Oracle’s GATHER STALE option? In a large proportion of the sites I visit, this is the strategy that is most widely deployed. To be fair, in the majority of cases for the majority of the time this can work perfectly well – but there are always those cases where people get called in the middle of the night because the ETL process is “running slow”. On the client sites I work on, I usually recommend a static approach to statistics in the staging area where possible. That is, you baseline your statistics up-front, and then leave them alone.
Note that this solution is not the only solution available. Other solutions include partitioning strategies which directly match your loading strategy or dynamic sampling (although this can also cause issues). It would be great to hear how other people handle statistics maintenance in their staging environment (ifat all).
There are usually three distinct type of tables within the staging schema –
Partitioned, time series tables – these tables make up the majority of tables in a typical staging area. They are all partitioned on a date range and will all be interrogated by a date range predicate.
Non-partitioned, time series tables – usually a very small minority. Usually fall into the category of “unable to partition” due to some data specific reasons.
Non-partitioned, non-time series tables – mainly reference type tables which are used as reference data. Not usually queried by a predicate, but taken in totality for each load if required.
The staging tables usually get populated by some outside source, by either pulling or pushing the data from the source systems. This process is usually an insert only process and therefore does not rely on statistics for it’s successful execution.
With a monthly partitioning strategy used in the staging schema and for a daily batch load, partition level statistics will always be used as the query will never pull out more than a single days worth of data.
The biggest question for the staging area is – how do we keep the statistics up-to-date such that the statistics for a particular daily load are always available and reasonably accurate. This is actually more difficult than it sounds. If we were to use the generic “gather stale” option, the partitions would only be analyzed in the first quarter of the month each night, going to every other night and eventually each week because of the 10% stale setting. This obviously leaves us with a problem. Also, when are the statistics to be gathered? In order to have the statistics available for the latest day which is loaded, the statistics would have to be gathered after the staging tables have been loaded but before the ETL process starts.
For example –
Day 24 of the month, staging load has just completed and the statistics have just been collected –
1 explain plan for
2 select * from stin_finance_transactions 3 where last_modified_dt >= to_date(’24-Mar-2005 00:00:00′,’DD-MON-YYYY HH24:MI:SS’) 4*
/ Explained. DW_STAGE@>/ PLAN_TABLE_OUTPUT ———————————————————————————-
Plan hash value: 2829342442 ———————————————————————————–
| Id | Operation | Name | Rows | ———————————————————————————–
| 0 | SELECT STATEMENT | | 1205K| | 1 | PARTITION RANGE SINGLE | | 1205K|
| 2 | TABLE ACCESS BY LOCAL INDEX ROWID| STIN_FINANCE_TRANSACTIONS | 1205K|
|* 3 | INDEX RANGE SCAN | UI_STIN_FINANCE_TRANSACTIONS | 1205K
Predicate Information (identified by operation id):
3 – access(“LAST_MODIFIED_DT”>=TO_DATE(‘2005-03-24 00:00:00’, ‘yyyy-mm-dd hh24:mi:ss’) AND ”LAST_MODIFIED_DT”
16 rows selected.
Effectively this means there are no maintenance requirements for the DW_STAGE statistics in the short / medium term. The only maintenance required will be when new partitions are added to the staging tables – which will simply require the application of the current baselines (handled by the supplied package). The possible negative implications of this approach is that the DW_STAGE schema will not support adhoc end user queries.
However, I personally believe this could be viewed as almost positive – yet another reason not to allow end-users access to complicated “partly relational” datasets and therefore reducing the possibilities of people basing business decisions on poor quality intelligence.
@Copyright 2012 Assertive Software Ltd All rights reserved