Tuesday, September 1, 2015

DWH Fundamentals


DATA WAREHOUSING FUNDAMENTALS

Data warehousing is a Techno functional module. It bridges gap between Technical and Functional activities

Any database has two types of systems:

1)      OLTP (Online Transaction Processing) system

2)      OLAP (Online Analytical Processing) system

DWH is a database containing the collection of current and historical data in a single platform drawn from different OLTP systems so that this data can be used for effective reporting and analysis

OLTP

*         Mainly used for Transaction purpose

*         INSERT, UPDATE and DELETE operations are implemented

*         Data is volatile

*         Normalized data

*         Contains current data (~ 1 yr)

OLAP

*         Mainly used for Reporting and Analysis

*         Only SELECT operation is used for reporting and analysis

*         Non-volatile data

*         De-normalized data

*         Contains historical data (~ 5 to 25 yrs)

Traditional Definitions of DWH

       DWH is a copy of the transactional data specially structured for reporting and analysis

       DWH is a Subject oriented, Integrated, Non-volatile and Time-variant data in order to support the effective decision making of the enterprise

OLTP supports two types of data

1)      Master Data is the data that is created only once, but changed occasionally. e.g., Emp#. An emp joins in a company only once, but his salary, designation,… may change occasionally

2)      Transactional Data is generated due to the business activities. e.g., Price, Quantity. Purchasing a product is a transaction

Granularity in DWH

Granularity specifies the level of detail with in DWH to be represented. It is specified by 3 major factors:

1)      Existing level of detail in the OLTP systems

2)      Requirements of the Business users

3)      Storage capacity of DWH

Ageing/Backup Process

As the time passes by; Current data becomes Historical

and Historical becomes Most Historical

So it is not a good idea to keep the most historical data in the online storage which is less frequently used

Better is to move the data to offline in order to accommodate incoming data from the OLTP systems

Such a process is called “Ageing Process”

Architecture of a DWH

Data warehousing architecture includes the following components:

1)      Source Systems

2)      Staging Layer

3)      DWH

4)      Data marts

5)      Reports

6)      Business users

Objectives of DWH

1)      Technical Objective of constructing a DWH is to separate the transactions from reporting & analysis

** If transactions and reporting are performed on same system a Deadlock might occur, which may lead to ineffective reporting **

2) The business users should be self capable of creating their queries to generate the reports

The technical details of the database can be hidden from the business users

Once the reports are produced, the business users can analyze the data multi-dimensionally

Phases of DWH implementations

1)      Modeling the DWH

2)      Migrating the data from OLTP systems to DWH with the help of ETL Process

3)      Utilizing the DWH for producing 2-dimensional and multi-dimensional reports

Multi-Dimensional Modeling

*         A model acts as a blue print for the entire DWH

*         Data modeling is the process of converting the business requirements into the technical specifications of the database

*         Multi-Dimensional modeling is the principle that is used to model data marts and DWH

*         Multi-Dimensional modeling is based on the concept called Star-Schema

STAR SCHEMA

Star Schema consists of two types of tables

1)      Dimensional Tables

2)      Fact Tables

*         Dimensional tables contain the Master data from OLTP systems, also called as Dimensions

*         Fact tables contain the Transactional data from OLTP systems, also called as Measures

*         Fact table in a Star Schema contains two sections:

*         i) Key Section

*         ii) Facts/Measures Section

All the Measures that are generated due to business activities in the OLTP systems should be recorded in the Fact section of the Fact table

A Star Schema contains multiple Dimensional tables and a single Fact table

** If a schema contains multiple Fact tables, it is called as “Galaxy” or “Congestion” **



*         Among the tables of the Star schema, Dimensional tables are independent and Fact tables are dependent in nature

*         To have the relationship between dimensional and fact tables, the primary keys of dimensional tables should be migrated/transferred as foreign keys to the keys section of the fact table

*         The Keys section of the fact table is a Composite Primary key which is a collection of different foreign keys from different dimensional tables.

Types of Star Schema’s

OLAP supports two types of Star schema’s:

1)      Star Flake Star Schema

2)      Snow Flake Star Schema

STAR FLAKE

*         Dimensional table’s are De-Normalized

*         Fact table’s are Normalized

*         Effective for reporting



SNOW FLAKE

*         Dimensional table’s are Normalized

*         Fact table’s are Normalized

*        





Initial and Delta Extracts
 



Effective for Transactions



Initial and Delta Extracts

*         Data extracted from OLTP systems into OLAP systems for the first time is treated as Initial Extract

*         Delta Extract specifies the changes

*         Changes are in the form of Inserts, Updates and Deletes

*         In the case of Delta Extract, only the records that has changed after the previous extract should be extracted into OLAP



*         If the DWH maintains only the current data, Inserts of the OLTP systems are treated as Inserts and Updates are treated as Updates  -- This is treated as Type I change

*         If OLAP systems maintain both current and historical data, Inserts are treated as Inserts and Updates are treated as Inserts – This is treated as Type II change


*         The Primary Keys of the OLTP systems should not be used as the Primary keys in DWH. It generates its own keys called asWarehouse Keys or Surrogate Keys for the purpose of internally identifying the records

*         In the above example EmpNo.’s 102 & 103 becomes duplicate. Hence WHEmpNo is kept as PK

Types of Dimension Tables

Master Data loaded into the Dimension tables is changed occasionally. These Dimension tables are treated as Slowly Changing Dimension tables

1) SCD Type 1 -- Maintains Current Data

                I – I

                U – U




*   Data extracted from OLTP systems into OLAP systems for the first time is treated as Initial Extract

*   Delta Extract specifies the changes

*   Changes are in the form of Inserts, Updates and Deletes

*   In the case of Delta Extract, only the records that has changed after the previous extract should be extracted into OLAP

 




2) SCD Type 2 -- Maintains Current Data & Full History

i) Flag Current Data

                I – I

                U – U

                U – I      


*         In Flag type, the previous version of the Flag should be changed from New to Old

*         The demerits of this type is, we cannot know the exact older version record

*         From above scenario, we cannot say which is the older among 3002 & 3008. In this situations we go for Version type

*         ii) Version Number

*                         I – I

*                         U – I

*         In this type all the updates are treated as Inserts

*         This type can identify the old version records

*                        


iii) Effective Date Range

                                I – I

                                U – U

                                U – I

*         By this type, we can track changes w.r.t Date


3) SCD Type 3 – Maintains Current Data & one time History

                I – I

                U – U


Types of Fact tables

1)      Transaction Grain Fact table

2)      Periodic Snapshot Fact table

3)      Accumulating Snapshot Fact table

4)      Fact less Fact table

1) Transaction Grain Fact table

*         The Grain specifies the level of details, transactions can be recorded

*         The Transaction Grain Fact table in OLAP system maintains the same level of details in OLTP system

*         For this type of tables, the data should be extracted from OLTP to OLAP on daily basis

*         Since the data is present at the lowest possible level of detail; Analysis can be done to the lowest possible extent

*         Operational Data Source (ODS) is the best example of this type

2) Periodic Snapshot Fact table

*         The data is loaded into the OLAP tables periodically

*         The user can have the correct estimate of the number of records that can be loaded

*         When ever the data is loaded, the data for that particular period is summarized and loaded as a single record

*         These type of tables regenerate the current data

*         The records are only inserted into these tables; but not updated

*         Telephone bill database is the best example for this type of tables

3) Accumulating Snapshot Fact table

*         All the values of the records of the Accumulating snapshot fact tables are not known in advance. We need to wait for the time to pass in order to make the events to occur to realize these particular values

*         These contain date columns

*         Data has to be loaded daily

*         Data in these tables represent the current data

*         Shipping Products is an example of Accumulating Fact table

4) Fact less Fact table

*         Some times regular star schema cannot provide the possible analysis. Such cases need accompanying star schema where the fact table will have only the key section but not measures; such type of tables are called Fact less Fact tables (Coverage tables)

*         Log-in and Log-out times of an employee doesn’t require any measures. This is an example of Fact less Fact table

Types of OLAP

1)      ROLAP

2)      MOLAP

3)      HOLAP

4)      DOLAP

*         Reports can be created from 2-dimensional or multi-dimensional structures

*         2-D reports are created from tables

*         Multidimensional reports are created from cubes

*         ROLAP (Relational OLAP)

*         The dimension tables and the fact tables in Relational OLAP exists as relational tables

*         Since all the data will be present in the relational database, the current data as well as historical data can participate in reporting process

*         Drilling the data can be done to the detailed level

*         Reporting is slow with ROLAP

*         2) MOLAP (Multi-dimensional OLAP)

*         In MOLAP, the data is physically stored in multi-dimensional databases called Cubes

*         Since the data is directly existing in multidimensional database, reporting is extremely fast

*         Data in multidimensional data marts is first migrated from OLTP systems to relational DWH and from relational DWH to multidimensional data marts

*         Current data will not participate in the reporting process with MOLAP

*         Drilling the data cannot be done to the detailed level

*         3) HOLAP (Hybrid OLAP)

*         HOLAP is the combination of ROLAP and MOLAP

*         With HOLAP drilling the data can be done to the detailed level with fast reporting

*         4) DOLAP (Desktop OLAP)

*         When ever the executives present the information at the time of seminars; critical information is stored in the desktop databases like MS-Access, Excel sheets and XML files. Such type of structures are called DLOP structures





Types of Dimensions

Dimension
A dimension table typically has two types of columns, primary keys to fact tables and textual\descriptive data.
Eg: Time, Customer
Types of Dimensions
  1. Slowly Changing Dimensions
  2. Rapidly Changing Dimensions
  3. Junk Dimensions
  4. Inferred Dimensions
  5. Conformed Dimensions
  6. Degenerate Dimensions
  7. Role Playing Dimensions
  8. Shrunken Dimensions
  9. Static Dimensions 
Slowly Changing Dimensions
Attributes of a dimension that would undergo changes over time. It depends on the business requirement whether particular attribute history of changes should be preserved in the data warehouse. This is called a slowly changing attribute and a dimension containing such an attribute is called a slowly changing dimension.
Rapidly Changing Dimensions
A dimension attribute that changes frequently is a rapidly changing attribute. If you don’t need to track the changes, the rapidly changing attribute is no problem, but if you do need to track the changes, using a standard slowly changing dimension technique can result in a huge inflation of the size of the dimension. One solution is to move the attribute to its own dimension, with a separate foreign key in the fact table. This new dimension is called a rapidly changing dimension.
Junk Dimensions
A junk dimension is a single table with a combination of different and unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk dimensions are often created to manage the foreign keys created by rapidly changing dimensions.
Inferred Dimensions
While loading fact records, a dimension record may not yet be ready. One solution is to generate a surrogate key with null for all the other attributes. This should technically be called an inferred member, but is often called an inferred dimension.
Conformed Dimensions
A dimension that is used in multiple locations is called a conformed dimension. A conformed dimension may be used with multiple fact tables in a single database, or across multiple data marts or data warehouses.
Degenerate Dimensions
A degenerate dimension is when the dimension attribute is stored as part of fact table, and not in a separate dimension table. These are essentially dimension keys for which there are no other attributes. In a data warehouse, these are often used as the result of a drill through query to analyze the source of an aggregated number in a report. You can use these values to trace back to transactions in the OLTP system.
Role Playing Dimensions
A role-playing dimension is one where the same dimension key — along with its associated attributes — can be joined to more than one foreign key in the fact table. For example, a fact table may include foreign keys for both ship date and delivery date. But the same date dimension attributes apply to each foreign key, so you can join the same dimension table to both foreign keys. Here the date dimension is taking multiple roles to map ship date as well as delivery date, and hence the name of role playing dimension.
Shrunken Dimensions
A shrunken dimension is a subset of another dimension. For example, the orders fact table may include a foreign key for product, but the target fact table may include a foreign key only for productcategory, which is in the product table, but much less granular. Creating a smaller dimension table, with productcategory as its primary key, is one way of dealing with this situation of heterogeneous grain. If the product dimension is snowflaked, there is probably already a separate table for productcategory, which can serve as the shrunken dimension.
Static Dimensions
Static dimensions are not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with status codes — or it can be generated by a procedure, such as a date or time dimension.




Data Warehousing



Lot of my friends and reader asking me to write a tutorial related to Microsoft BI tools. As I personally feel that the Data Warehousing is not just understand or practice via some Tools provided by Microsoft, it need deep understanding analysing with data. Well we can learn the tools very easily but sensing the data and information is quite tough to learn. It’s growing with maturity and hard work. 

Well if readers want me to write something, here I am trying to give them something by my article.

In this article I am trying to understand the concept behind data ware housing. Why we all think about it.

What is the Data Warehousing?

One of the main features of data warehousing is to combining data from heterogeneous data sources into one comprehensive and easily maintained database.
The common accessing systems of data warehousing includes 

 Queries

Analysis

Reporting

As the number of source can be anything, the data warehouse creates one database at the end. The final result however, is homogeneous data, which can be more easily manipulated.

Data warehousing is commonly used by companies to analyse trends over time. Its primary function is facilitating strategic planning resulting from long-term data overviews. From such overviews, business models, forecasts, and other reports and projections can be made. Routinely, because the data stored in data warehouses is intended to provide more overview-like reporting, the data is read-only. If you want to update the data stored via data warehousing, you'll need to build a new QUERY when you're done.

We are not saying that data warehousing involves data that is never updated. On the contrary, the data stored in data warehouses is updated all the time. It's the reporting and the analysis that take more of a long-term view.

Data warehousing is not the be-all and end-all for storing all of a company's data. Rather, data warehousing is used to house the necessary data for specific analysis. More comprehensive data requires different capacities that are more static and less easily manipulated than those used for data warehousing.

Data warehousing is typically used by larger companies analysing larger sets of data for enterprise purposes.

Smaller companies wishing to analyse just one subject, for example, usually access data marts, which are much more specific and targeted in their storage and reporting. Data warehousing often includes smaller amounts of data grouped into data marts. In this way, a larger company might have at its disposal both data warehousing and data marts, allowing users to choose the source and functionality depending on current needs.

Hope you like it. In my next session I am directly jump over Microsoft BI tools Introduction and try to discuss when you used them.

Hope you like it.

Data Warehouse Concepts

Paths can be many but Destination is One. With this I mean to say, Different people have different explanations for a Data warehouse. But all the explanations and definitions aims to single destination i.e. Data storage. If you are thinking of aligning your career in BI/DW (Business Intelligence/Data warehousing), than you definitely need to know two names. I call these two persons as “Gods of BIDW”.
First name is Ralph Kimball and second name is Bill Inmon. You must be wondering why we need to know about these two personalities. This is because they have given some important Data warehouse concepts which need to be followed.
  • Ralph Kimball provided a “Bottom – Up approach”, whereas
  • Bill Inmon provided “Top – Down approach”.
In the world of BIDW, the most popular definition came from Bill Inmon, who said “A Data warehouse is a Subject-oriented, Integrated, Time-variant and Non-volatile collection of Data in support of Management’s decision making process”.
  1. Subject-Oriented : It means a Data warehouse can be used to analyze a specific subject area. For example – In an Organization, Subject area can be HR, Finance, Sales, etc. So, A Data warehouse can be used to analyze particularly HR or Sales.
  2. Integrated : With Integrated we mean data from heterogeneous data sources is integrated in Data warehouse. For example – Source A stores date in different format and source B stores date in other format, but in a Data warehouse, dates will be stored in a single format only.
  3. Time-Variant : Data warehouse only deals with historical data. It acts as a pit where Transactional data is dumped every day. We can retrieve data which is 6 months, 12 months old or even more than this from a data warehouse. For example – You hold a SBI bank account and updated your contact details. So, Transaction system can only provide you the current contact details whereas Data warehouse holds all the contact details associated with your account (Current and Previous both).
  4. Non-volatile : As explained above, Data warehouse is freakily mad about historical data. Once data comes in the data warehouse, it will not alter. Hence, historical data in a data warehouse is Non-volatile.
Another concept came from Ralph Kimball who explained Data warehouse in his own way and said “A data warehouse is a copy of transactional data specifically structured for query and analysis”.
This is a more like a functional view of a Data warehouse. Bill Inmon precisely stated how the Data warehouse is built which was lacking in Ralph Kimball’s definition.
Data warehouse usages –
  • Trend analyzing for an organization.
  • Startegic planning.
  • Product forecasting.
  • Designing business models.
  • Preparing dashboards and reports with reporting tools.
Summary – In this post, we have learnt
  • What do we mean by Data warehouse.
  • Definition provided by Bill Inmon.
  • Definition provided by Ralph Kimball.
  • How Transactional database is different from Data warehouse.
  • Data warehouse usages.

Introduction to Data warehousing

Have you ever thought why GoogleFacebookTwitter and many other websites are position at the top rank? Have you ever thought how Wall mart, Reliance and other shopping marts are making profits? Have you ever analyzed the ways ebay, flipkart, Snapdeal and other online stores follows to achieve higher revenue each year? There is a single answer for all these questions and it is Data warehousing and Business Intelligence. With this you must get to know the power of this word “DATA”. Data warehousing provides the enterprise with Memory and Data Mining provides the enterprise with Intelligence.
“Data warehousing is a concept and it provides an architectural model for the flow of data from Operational data System to Decision support system.”
Data utilization
Earlier, we were dependent only on Transactional Processing System (TPS) also known as Real-time processing. TPS was used for business transactions to collect daily transactional data. All day-to-day transactions were captured and stored into the database. Then, the point of concern for Business people or Decision makers was to have high network bandwidth to send data to distributed platforms on daily basis and Business Intelligence was given low priority. But until what time can the need of Business intelligence and Data warehousing be ignored.
In 1970’s, Business experts sensed the need of Data warehousing and Business Intelligence and that sense has now become the essence of each organization. With the analysis of the data stored at the Data warehouse, Decision makers can view the immediate impact on the bottom line and top lineof the organization. Also, it provides greater insight of customers which helps the organization to retain them and gives better visibility of the organization.

Need for Data warehousing

Today’s mantra of success is “Proactive use of Information than Optimization”. This is a world of competition and organizations start focusing on utilizing data they collects from day-to-day transactions. Organizations realize the Importance of information they hold, to improve decision making and achieve profits. But, Operational systems were never designed to perform these intelligent activities and probably never can be. So, this arouses a need of such a system which is mature enough to support decision makers and analyze business process. As said by Plato (a Greek philosopher) – “Necessity is the mother of Invention”, this need leads to the development of a system known as Decision Support System (DSS) i.e. Data warehouse. Let’s gain more information on these two systems i.e. TPS and DSS.
Transaction-processing systems are built to capture day to day information and update that collectected information quickly. The TPS is constantly changing and is generally online 24 hours a day.
Transaction Processing System Examples – In shopping outlets, Point-of-sales scans your order and generates bill, Railway reservation applications, Order placing at online stores like flipkart, myntra, etc.
Decision-support systems are built to analyze the data stored in data warehouse. It deals with historical data and helps in extracting the information easily and quickly.
Decision Support System Examples – Shopping outlets analyze which product is to be kept with which product so as to gain more sales and generate maximum revenue, Information for marketing, Claiming Insurance, LIC policies.
Decision makers in Data warehousing

Problem with Data warehousing

You must have read a line from a great poet, William Shakespeare – “Water, water, everywhere nor any drop to drink”. It means in sea there is water all around but you cannot drink that saline water. In the same way, Data warehouse contains volume of data but part of it is important to you that you have to look.
  • I can’t find the data I need – Data is scattered over the network.
  • I can’t get the data I need – Need an expert to get the data.
  • I can’t understand the data I found – Available data is poorly documented.
  • I can’t use the data I found – Data need to be transformed from one form to another.
Apart from these issues, Cost is the major drawback because Data warehousing is a very costly process and requires huge amount of data for Analysis and Reporting processes.

Benefits of Data warehousing

Data warehousing is a concept of merging data from heterogeneous data sources into one database. Organizations spend vast resources in terms of human resource as well as financial resource forimplementing Data warehouse. Why the companies invest so much into this? What is the benefit of spending resources on storing data? Let’s see the benefits provided by Data warehousing to the organization.
Benefits and Advantages of Datawarehousing
  • It facilitates Strategic Planning and considered as a backbone of organizations. Business People get to know market’s trend and can make better decisions.
  • It provides an option of Better Data Analysis as Data from varied sources is grouped into Data marts. Companies can analyze data in a better way by querying on multiple dimensions (Multi Dimensional Query).
  • It provides flexibility in Operation Systems as it operates separately from other operation processes, Data retrieval can be done with ease without affecting other operation requirements.
  • It helps in identifying and resolving dirty data, with dirty data I mean to say Inconsistent data. All the dirty/inconsistent data is resolved and a homogenous data is then grouped into Database. As a result, Data Retrieval and Data Analysis process gets simplified.
  • It helps to improve Decision making process as it provides market trends and forecast.
  • It is used to enhance Customer service and provide better enterprise intelligence.
Free e-books and Pdf’s :-
Please Subscribe or Like PhpRinG Tutorials for free e-books and pdf’s.
  • Launching Very Soon free e-books and pdf’s on Data warehousing.
I hope you like our post on Data warehousing Concepts. Please provide feedback in the form of comments below. Stay tune to PhpRinG Tutorials for more tutorials on Data warehousing in future!!!

Schema & Types of Schemas

Schema


       A schema is a collection of objects such as tables, views, indexes, synonyms. It simply explains the logically diagram of Database and relation between tables by using key attributes. It includes the name and description of records of all record types including all associated data-items and aggregates.

There are four types of schemas are available in data warehouse.

Star Schema
Snow Flake Schema
Extended Star Schema
Galaxy Schema

Star Schema with Diagram

Star Schema
·   In Star Schema all dimensional tables are related with One Fact table.
·   The dimension table contains the set of attributes. Each Dimension table has a one to many relationship with Fact table.
·   Simple Star Schema contains only one Fact Table where as a complex star schema contains more than one Fact table.
·   Star schemas are denormalized that reason it retrieve data faster than highly normalized schemas.


     The above diagram explains Simple Star Schema. There is a fact table at the center of location and it contains the keys to four dimensions.


Snow Flake Schema with Diagram

Snow Flake Schema
·      In snow Flake schema also same as star schema like all dimension tables related with Fact table and some of the Dimension table linked with Dimension tables only.
·      One or more dimension tables are partially or completely normalized in Snowflake schema.
·      Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table.
·      The main Principle of snowflake is normalization of the dimension tables by removing redundancy of attributes and creating separate tables.





                The above diagram explains that the dimension tables of Customer, Time, Product and Geography are connected to Fact table of FactInternetSales , Product Dimension table related with another dimension table ProductSubcategory.


Extended Star Schema with Diagram

Extended Star Schema

·         In Star Schema Fact tables are connected to Dimension tables only. But Extended Star Schema Fact table is connected to Dimension table and the Dimension table is connected to SID (Surrogate ID) table and SID table is connected to the master data tables.
·         One Fact table connected to 16 Dimension tables, one Dimension table can be assigned with maximum of 248 SID Tables in Extended Star schema.
·         Main Advantage of Extended Star schema is keeping the master data outside of dimension tables , we are able to share the master data across the system there by reducing the need of redundancy of master table.

·         It is mainly used in SAP BW projects for fast loading of data and fast access to reports.


Galaxy Schema with Diagram

Galaxy Schema
·         The Galaxy schema is also known as fact constellation schema. It is combination of star schema and snow flake schema.
·         Galaxy schema contains many fact tables with some common dimension (Confirmed Dimensions)
·         This Schema is a combination of many data marts. One fact table may represent one data mart.