DATA WAREHOUSING FUNDAMENTALS
Data warehousing is a Techno functional module. It bridges gap between Technical and Functional activities
Any database has two types of systems:
1) OLTP (Online Transaction Processing) system
2) OLAP (Online Analytical Processing) system
DWH is a database containing the collection of current and historical data in a single platform drawn from different OLTP systems so that this data can be used for effective reporting and analysis
OLTP
* Mainly used for Transaction purpose
* INSERT, UPDATE and DELETE operations are implemented
* Data is volatile
* Normalized data
* Contains current data (~ 1 yr)
OLAP
* Mainly used for Reporting and Analysis
* Only SELECT operation is used for reporting and analysis
* Non-volatile data
* De-normalized data
* Contains historical data (~ 5 to 25 yrs)
Traditional Definitions of DWH
• DWH is a copy of the transactional data specially structured for reporting and analysis
• DWH is a Subject oriented, Integrated, Non-volatile and Time-variant data in order to support the effective decision making of the enterprise
OLTP supports two types of data
1) Master Data is the data that is created only once, but changed occasionally. e.g., Emp#. An emp joins in a company only once, but his salary, designation,… may change occasionally
2) Transactional Data is generated due to the business activities. e.g., Price, Quantity. Purchasing a product is a transaction
Granularity in DWH
Granularity specifies the level of detail with in DWH to be represented. It is specified by 3 major factors:
1) Existing level of detail in the OLTP systems
2) Requirements of the Business users
3) Storage capacity of DWH
Ageing/Backup Process
As the time passes by; Current data becomes Historical
and Historical becomes Most Historical
So it is not a good idea to keep the most historical data in the online storage which is less frequently used
Better is to move the data to offline in order to accommodate incoming data from the OLTP systems
Such a process is called “Ageing Process”
Architecture of a DWH
Data warehousing architecture includes the following components:
1) Source Systems
2) Staging Layer
3) DWH
4) Data marts
5) Reports
6) Business users
Objectives of DWH
1) Technical Objective of constructing a DWH is to separate the transactions from reporting & analysis
** If transactions and reporting are performed on same system a Deadlock might occur, which may lead to ineffective reporting **
2) The business users should be self capable of creating their queries to generate the reports
The technical details of the database can be hidden from the business users
Once the reports are produced, the business users can analyze the data multi-dimensionally
Phases of DWH implementations
1) Modeling the DWH
2) Migrating the data from OLTP systems to DWH with the help of ETL Process
3) Utilizing the DWH for producing 2-dimensional and multi-dimensional reports
Multi-Dimensional Modeling
* A model acts as a blue print for the entire DWH
* Data modeling is the process of converting the business requirements into the technical specifications of the database
* Multi-Dimensional modeling is the principle that is used to model data marts and DWH
* Multi-Dimensional modeling is based on the concept called Star-Schema
STAR SCHEMA
Star Schema consists of two types of tables
1) Dimensional Tables
2) Fact Tables
* Dimensional tables contain the Master data from OLTP systems, also called as Dimensions
* Fact tables contain the Transactional data from OLTP systems, also called as Measures
* Fact table in a Star Schema contains two sections:
* i) Key Section
* ii) Facts/Measures Section
All the Measures that are generated due to business activities in the OLTP systems should be recorded in the Fact section of the Fact table
A Star Schema contains multiple Dimensional tables and a single Fact table
** If a schema contains multiple Fact tables, it is called as “Galaxy” or “Congestion” **
* Among the tables of the Star schema, Dimensional tables are independent and Fact tables are dependent in nature
* To have the relationship between dimensional and fact tables, the primary keys of dimensional tables should be migrated/transferred as foreign keys to the keys section of the fact table
* The Keys section of the fact table is a Composite Primary key which is a collection of different foreign keys from different dimensional tables.
Types of Star Schema’s
OLAP supports two types of Star schema’s:
1) Star Flake Star Schema
2) Snow Flake Star Schema
STAR FLAKE
* Dimensional table’s are De-Normalized
* Fact table’s are Normalized
* Effective for reporting
SNOW FLAKE
* Dimensional table’s are Normalized
* Fact table’s are Normalized
*
|
Effective for Transactions
Initial and Delta Extracts
* Data extracted from OLTP systems into OLAP systems for the first time is treated as Initial Extract
* Delta Extract specifies the changes
* Changes are in the form of Inserts, Updates and Deletes
* In the case of Delta Extract, only the records that has changed after the previous extract should be extracted into OLAP
* If the DWH maintains only the current data, Inserts of the OLTP systems are treated as Inserts and Updates are treated as Updates -- This is treated as Type I change
* If OLAP systems maintain both current and historical data, Inserts are treated as Inserts and Updates are treated as Inserts – This is treated as Type II change
* The Primary Keys of the OLTP systems should not be used as the Primary keys in DWH. It generates its own keys called asWarehouse Keys or Surrogate Keys for the purpose of internally identifying the records
* In the above example EmpNo.’s 102 & 103 becomes duplicate. Hence WHEmpNo is kept as PK
Types of Dimension Tables
Master Data loaded into the Dimension tables is changed occasionally. These Dimension tables are treated as Slowly Changing Dimension tables
1) SCD Type 1 -- Maintains Current Data
I – I
U – U
|
2) SCD Type 2 -- Maintains Current Data & Full History
i) Flag Current Data
I – I
U – U
U – I
* In Flag type, the previous version of the Flag should be changed from New to Old
* The demerits of this type is, we cannot know the exact older version record
* From above scenario, we cannot say which is the older among 3002 & 3008. In this situations we go for Version type
* ii) Version Number
* I – I
* U – I
* In this type all the updates are treated as Inserts
* This type can identify the old version records
*
iii) Effective Date Range
I – I
U – U
U – I
* By this type, we can track changes w.r.t Date
3) SCD Type 3 – Maintains Current Data & one time History
I – I
U – U
Types of Fact tables
1) Transaction Grain Fact table
2) Periodic Snapshot Fact table
3) Accumulating Snapshot Fact table
4) Fact less Fact table
1) Transaction Grain Fact table
* The Grain specifies the level of details, transactions can be recorded
* The Transaction Grain Fact table in OLAP system maintains the same level of details in OLTP system
* For this type of tables, the data should be extracted from OLTP to OLAP on daily basis
* Since the data is present at the lowest possible level of detail; Analysis can be done to the lowest possible extent
* Operational Data Source (ODS) is the best example of this type
2) Periodic Snapshot Fact table
* The data is loaded into the OLAP tables periodically
* The user can have the correct estimate of the number of records that can be loaded
* When ever the data is loaded, the data for that particular period is summarized and loaded as a single record
* These type of tables regenerate the current data
* The records are only inserted into these tables; but not updated
* Telephone bill database is the best example for this type of tables
3) Accumulating Snapshot Fact table
* All the values of the records of the Accumulating snapshot fact tables are not known in advance. We need to wait for the time to pass in order to make the events to occur to realize these particular values
* These contain date columns
* Data has to be loaded daily
* Data in these tables represent the current data
* Shipping Products is an example of Accumulating Fact table
4) Fact less Fact table
* Some times regular star schema cannot provide the possible analysis. Such cases need accompanying star schema where the fact table will have only the key section but not measures; such type of tables are called Fact less Fact tables (Coverage tables)
* Log-in and Log-out times of an employee doesn’t require any measures. This is an example of Fact less Fact table
Types of OLAP
1) ROLAP
2) MOLAP
3) HOLAP
4) DOLAP
* Reports can be created from 2-dimensional or multi-dimensional structures
* 2-D reports are created from tables
* Multidimensional reports are created from cubes
* ROLAP (Relational OLAP)
* The dimension tables and the fact tables in Relational OLAP exists as relational tables
* Since all the data will be present in the relational database, the current data as well as historical data can participate in reporting process
* Drilling the data can be done to the detailed level
* Reporting is slow with ROLAP
* 2) MOLAP (Multi-dimensional OLAP)
* In MOLAP, the data is physically stored in multi-dimensional databases called Cubes
* Since the data is directly existing in multidimensional database, reporting is extremely fast
* Data in multidimensional data marts is first migrated from OLTP systems to relational DWH and from relational DWH to multidimensional data marts
* Current data will not participate in the reporting process with MOLAP
* Drilling the data cannot be done to the detailed level
* 3) HOLAP (Hybrid OLAP)
* HOLAP is the combination of ROLAP and MOLAP
* With HOLAP drilling the data can be done to the detailed level with fast reporting
* 4) DOLAP (Desktop OLAP)
* When ever the executives present the information at the time of seminars; critical information is stored in the desktop databases like MS-Access, Excel sheets and XML files. Such type of structures are called DLOP structures
Types of Dimensions
Dimension
A dimension table typically has two types of columns, primary keys to fact tables and textual\descriptive data.
Eg: Time, Customer
Types of Dimensions
- Slowly Changing Dimensions
- Rapidly Changing Dimensions
- Junk Dimensions
- Inferred Dimensions
- Conformed Dimensions
- Degenerate Dimensions
- Role Playing Dimensions
- Shrunken Dimensions
- Static Dimensions
Slowly Changing Dimensions
Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a slowly changing attribute and a dimension containing such an attribute is called a slowly changing dimension.
Rapidly Changing Dimensions
A dimension attribute that changes frequently is a rapidly changing attribute. If you don’t need to track the changes, the rapidly changing attribute is no problem, but if you do need to track the changes, using a standard slowly changing dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a rapidly changing dimension.
Junk Dimensions
A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by rapidly changing dimensions.
Inferred Dimensions
While loading fact records, a dimension record may not yet be ready. One solution is to generate a surrogate key with null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.
Conformed Dimensions
A dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.
Degenerate Dimensions
A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.
Role Playing Dimensions
A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both ship date and delivery date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of role playing dimension.
Shrunken Dimensions
A shrunken dimension is a subset of another dimension. For example, the orders fact table may include a foreign key for product, but the target fact table may include a foreign key only for productcategory, which is in the product table, but much less granular. Creating a smaller dimension table, with productcategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the product dimension is snowflaked, there is probably already a separate table for productcategory, which can serve as the shrunken dimension.
Static Dimensions
Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with status codes — or it can be generated by a procedure, such as a date or time dimension.
Data Warehousing
Lot of my friends and reader asking me to write a tutorial related to Microsoft BI tools. As I personally feel that the Data Warehousing is not just understand or practice via some Tools provided by Microsoft, it need deep understanding analysing with data. Well we can learn the tools very easily but sensing the data and information is quite tough to learn. It’s growing with maturity and hard work.
Well if readers want me to write something, here I am trying to give them something by my article.
In this article I am trying to understand the concept behind data ware housing. Why we all think about it.
What is the Data Warehousing?
One of the main features of data warehousing is to combining data from heterogeneous data sources into one comprehensive and easily maintained database.
The common accessing systems of data warehousing includes
Queries
Analysis
Reporting
Queries
Analysis
Reporting
As the number of source can be anything, the data warehouse creates one database at the end. The final result however, is homogeneous data, which can be more easily manipulated.
Data warehousing is commonly used by companies to analyse trends over time. Its primary function is facilitating strategic planning resulting from long-term data overviews. From such overviews, business models, forecasts, and other reports and projections can be made. Routinely, because the data stored in data warehouses is intended to provide more overview-like reporting, the data is read-only. If you want to update the data stored via data warehousing, you'll need to build a new QUERY when you're done.
We are not saying that data warehousing involves data that is never updated. On the contrary, the data stored in data warehouses is updated all the time. It's the reporting and the analysis that take more of a long-term view.
Data warehousing is not the be-all and end-all for storing all of a company's data. Rather, data warehousing is used to house the necessary data for specific analysis. More comprehensive data requires different capacities that are more static and less easily manipulated than those used for data warehousing.
Data warehousing is typically used by larger companies analysing larger sets of data for enterprise purposes.
Smaller companies wishing to analyse just one subject, for example, usually access data marts, which are much more specific and targeted in their storage and reporting. Data warehousing often includes smaller amounts of data grouped into data marts. In this way, a larger company might have at its disposal both data warehousing and data marts, allowing users to choose the source and functionality depending on current needs.
Hope you like it. In my next session I am directly jump over Microsoft BI tools Introduction and try to discuss when you used them.
Hope you like it.