LOADING DIMENSIONS
Some dimensions are created automatically without involving the ETL process. Those contain the operational code translated into words and have no external sources.
The whole rest extracted from outside sources , frequently multiple , require special processes before delivering it into database like:
- Data cleaning-identifying and correcting or removing inaccurate, incorrect , incomplete etc. parts of the data
- Data conforming-aligning the content of some fields in the dimension with fields that are similar and situated in other part of database.
- Data-delivering module - managing slowly changing dimensions , creating the surrogate key and loading dimensions with appropriate structure , primary keys , natural keys and descriptive attributes.
In the finally step of preparing dimensional tables all dimensions should be flat-denormalized. If they are not there is formed a snowflake dimension.
The difference between those two models is that in the snowflake model data has been grouped into multiple tables instead of one large table in the flat model so generally flat tables are less problematic and may appear directly in users interface. However sometimes snowflake tables are recommended.
Dimension table fields
The dimension tables should have at least fields listed below and contain fields used to group data during the database inquiry process.
Those are three types of fields:
- The primary key(surrogate key)-joins with the foreign key in the fact table and allows to connect those two tables to ensure the integrity. It has no business meaning and used to maintain a hierarchy.
- The natural key(domain key) – a descriptor of the data. It is based on attributes that exist. The relationship of the surrogate key and natural key may be one o one or many to one(in slowly changing dimension)
- Descriptive attributes (textual or numeric)
Different types of dimensions
Now we will look closer at the different types of dimensions.
Date and time dimension
It is created in the very beginning of data warehouse project and has usually a form of calendar with the granularity of a single day. It contains data types attributes , full data descriptions attributes and some additional records(at least one) for inapplicable , incorrect data or data that hasn’t appeared yet.
Time dimension is a kind of data dimension but limited in time like: month or year dimension table. It is necessary if we have fact tables subordinated to a specified period of time.
Those two types of dimensions should be created separately!
Big dimensions
When we talk about big dimensions we mean such as a customer or product dimension. Loading it may be quite terrifying because of a big number of records , a big number of fields in every record and their multiple sources( it is placed in many various places in enterprise system).
In this situation there is a need to perform three steps to create a single dimension table:
- deduplication step which is a part of data cleaning-module
- conforming step which is a part of data-conforming module
- merge step which is a part of delivering-module
Taking under consideration the customer dimension we have to remove all duplications to create a correct total number of customers , align all attributes with the same content to create a single one value and finally place all attributes in one big dimension record.
Small dimensions
Have a form of little tables with only a few records and columns like a little transaction-type dimension with labels of each type or junk dimension with transactional codes. Those dimensions are usually unique for a one fact table and should not be built across the various facts tables.
Related dimensions
The dimensions in the database should be, through the ideology, independent. However there is usually a relationship between them.
If this relationship occurs between big dimensions like product and store the combining them into a one is not recommended , so we make them separately depicting these relationship in the sale fact table. This step of dimensional modeling is called demoting the correlation between dimensions into a fact table.
However if this relationship concerns overlapping dimensions the combining them into a one single dimension is then very recommended.
Dimensional role playing
This situation occurs when a single dimension , often a time dimension, is attached multiple times to the one fact table. For example in an accumulating transaction fact table there are many foreign keys for time dimensions like: order data , deliver date , payment date.
How to resolve this problem? We need to built a generic single dimension table and implement those three views.
Sub dimensions
Sometimes there is a relationship between two dimensions. Then one is a sub-dimension of the another. For example we have a data dimension table and customer dimension table which has a foreign key in first purchase date related to the data dimension.
Empty dimensions
Degenerate or empty dimension is a dimension derived from a fact table and it doesn’t have its own dimension table (It is stored in a fact table).For example if we have a dimension with only one record, it is unnecessary and very inconvenient to create it separately. Then we should locate it in a fact table as a degenerate dimension without the foreign key.
Slowly changing dimension (SCD)
If the data in the dimension tables is slowly changing during the time we say that it is a slowly changing dimension.
There are three major ways of loading and handling these dimensions:
- Type one-overwrites old data with a new data and doesn’t keep the history. It is very often used when we want to correct inappropriate data like wrong address or misspelled name.
- Type two-tracks the history and create new records in the dimensional table with separate keys. It is unlimited (creates new records each time a change is made) and the most common.
- Type three-also tracks a history but create new columns , not records. It is limited by number of columns we design.
Hybrid type-it is a combination of those three mentioned types. For example some fields in the dimension present the type 1 and then there are overwritten while other may present type 2 or 3.
In slowly changing dimension the relationship between the primary (surrogate)key and natural key is one to many
However sometimes those changes may be overlooked and we have to correct bad data. Correcting a type 1 fields is very simple-just overwriting but others may be quite difficult.
Bridge table
The bridge table is also called a helper table or reference table. It is used when there is a many to many relationship between fact and dimension table and it contains only the keys. It can be explained with a following example.
Lets say there are two tables: an employee table and a department table and we create a bridge table. It will contain information which employee belongs to which department. It makes our database more clearly.
The bridge table is also used for another purpose-to flatten out the hierarchy in the dimension table.
The hierarchy
The loading process generates very ragged and unbalanced hierarchy in the database. There are two ways to store hierarchies: by using recursive pointer or bridge table.
In the recursive pointer approach each record has a primary key and a foreign key to connect with another record in the same table. It embeds the hierarchy in the dimension table while in the bridge table the relationship is located in it and the dimension table is left untouched.
In the recursive pointer it is much easier to add and move some portions of the hierarchy while in the bridge table it is more complex. Each type of approach has some advantages and disadvantages. Which one will you choose it depends on you.