Introduction to Slowly Changing Dimensions (SCD)
A dimension is a structure that categorizes a collection of information so that meaningful answers to questions regarding that information may be obtained. Dimensions in data management and data warehouses contain relatively static data; however, this dimensional data can change slowly over time and at unpredictable intervals. These types of dimensional data are known as Slowly Changing Dimensions (SCD).
In a data warehouse environment, a dimension table has a primary key that uniquely identifies each record and other pieces of information that are known as the dimensional data. The primary key will link to a fact table using this key. Therefore, it is important to have a model for updating and handling the current data and the historical data. There are recognized models that are known by their types:
- Type 1 – This model involves overwriting the old current value with the new current value. No history is maintained.
- Type 2 – The current and the historical records are kept and maintained in the same file or table.
- Type 3 – The current data and historical data are kept in the same record. The user decides how much history is kept in the record.
- Type 4 – In this model, the current data is maintained in two different tables; one for the current data and one that contains all the historical data.
- Type 6 – This model is a hybrid of Type 1, Type 2, and Type3.
All the update methods for the different SCD models can be accomplished using the SortCL program in IRI CoSort or IRI Voracity. Voracity now also provides a single end-to-end job creation wizard to build these programs automatically; see below. Most types use a full outer join to match records from the master data source with records in the update source based on equating a unique key from each. Records with matches need to be updated in the master. Records in the update source that do not have a match need to be added to the master.
My articles on the different SCD types show the source or input data in files and shows the SortCL scripts that are used to update the master files. Tables can also be a source with these scripts.
In each example, I cover the steps to update a dimensional source that maintains product costs that change at irregular intervals. The key field or column is ProductCode. All source master files will contain the following data:
The master1.dat source contains:
ProductCode | Cost | StartDate |
---|---|---|
C123 | 125.50 | 20110228 |
F112 | 2365.00 | 20120101 |
G101 | 19.25 | 20110930 |
J245 | 450.50 | 20110430 |
S022 | 98.75 | 20110515 |
The update data will all have the same StartDate. The update source contains the following data:
ProductCode | Cost | StartDate |
---|---|---|
F112 | 2425.00 | 20120701 |
J245 | 550.50 | 20120701 |
M447 | 101.75 | 20120701 |
S022 | 101.75 | 20120701 |
SCD reporting is also discussed on the IRI web site here. See these articles specific to SCDs: Type 1, Type 2, Type 3, Type 4, and Type 6.
New SCD Graphical Wizard
Voracity users can now also use a purpose-built job creation wizard to report on SCD types 1, 2, 3, 4, and 6 in the IRI Workbench GUI, built on Eclipse. The wizard also supports the integration of sorting, expression evaluation, aggregation, new formatting, encrypting, and more.
After setting up the job and selecting the SCD type in the wizard, you specify the master data and update information. After that, you specify the mappings for the target, placing the fields in line with the matched source fields, and using the combo boxes as needed, work with current and historical field sets. After joining the specified information on the Join Sources page, the appropriate script is created and ready to diagram, execute, modify, or share.
The articles associated with the types above will be updated to reflect how the new SCD wizard in Voracity can auto-create these jobs for you.
1 COMMENT
[…] http://www.iri.com/blog/vldb-operations/introduction-to-slowly-changing-dimensions-scd/ […]