Slowly Changing Dimensions (SCD)
Slowly Changing Dimensions (SCD) is a crucial concept in dimension modeling, particularly in the context of data warehousing and business intelligence. SCD deals with handling changes to dimension attributes over time, ensuring that historical data remains accurate and available for analysis. There are typically three common types of Slowly Changing Dimensions:
- Type 0: retain original
The Type 0 dimension attributes never change and are assigned to attributes with durable values or are described as ‘Original’. Examples: Date of Birth, Original Credit Score. Type 0 applies to most date dimension attributes
2. Type 1: No History (Overwrite):
- In this approach, when a dimension attribute changes, the old value is overwritten with the new value.
- Historical data is not preserved; the dimension reflects only the most recent state.
- This method is suitable when historical data is not important for analysis or storage efficiency is a primary concern.
Let’s suppose that Adam moves from New York to New Jersey.
So we don’t have any information about the previous city.
3. Type 2: Add New Row (Historical):
- In the Type 2 SCD approach, when a dimension attribute changes, a new row is added to the dimension table to capture the new attribute value.
- This preserves the historical state of the dimension, allowing for historical analysis.
- Each row typically has an effective date range to indicate when it was valid.
- A surrogate key is often used to relate fact tables to the dimension tables.
- Type 2 is appropriate when historical data is important for analysis and reporting.
- Example: (Using a flag) In this approach, we use a flag to indicate if a particular value is active or if it is current.
To get the latest update on all customers or a given customer, we just need to query isActive=True
. Every time there is a change, we need to add a new row and update the isActive
column of the previous rows False
and the newest to True
.
4. Type 3: Add Columns (Partial Historical):
- Type 3 SCD involves adding new columns to the dimension table to store limited historical data. Typically, two columns are added: one for the current value and one for the previous value.
- This approach allows for limited historical analysis by comparing the current and previous values.
- It’s useful when full historical data is unnecessary, and there is a desire to keep historical changes in the same row.
5. Type 4: Hybrid Slowly Changing Dimension (SCD):
- Also known as “Type 4” or “SCD Type 4,” this hybrid approach combines Type 2 and Type 1 elements.
- This method creates a separate historical table to track changes over time while the main dimension table remains static.
- The historical table captures the history of changes, allowing for historical analysis.
- It can be more space-efficient than a pure Type 2 approach since the main dimension table remains relatively small.
- Example: In the following diagram, if we assume that the carpool (also known as High occupancy vehicles) pass needs to be purchased every month, we can move that field to a smaller mini-dimension and reference it directly from the fact table:
we can split the table into a mini DimCarPool Dimension.
6. Type 6: Type 1 + Type 2 (Combination SCD):
- Type 6 SCD is a combination of Type 1 and Type 2 approaches.
- In this method, you maintain a current value in the main dimension table (Type 1) while creating new rows for historical changes (Type 2).
- This provides the most recent data and a history of changes, but it can be more complex to manage than other methods.
- Example:
The choice of which SCD type to use depends on the specific needs of your organization and the nature of the data you are working with. It’s essential to consider factors like data update frequency, the importance of historical data, query performance, and data storage constraints when deciding on the appropriate SCD method for your dimension modeling. Type 1 and Type 2 are the most commonly used SCD types, but the others may be valuable in specific scenarios.
Choosing the appropriate SCD method depends on the specific requirements of the organization and the nature of the data. Type 2 SCD is the most common choice for data warehousing and analytics, as it balances historical preservation and simplicity. However, Type 1 and Type 3 can be helpful in specific scenarios.
SCD is essential because it ensures data accuracy in data warehousing and analytics environments, enabling organizations to analyze historical trends, track changes, and make informed decisions based on historical data. The choice of SCD method should align with the business needs and the dimension modeling strategy in place.
Reference:
- https://github.com/arunp77/Database-datapipeline-ETL/tree/main/Database
- For more details, please visit my website.