Python: What is Datawarehouse?

A data warehouse is a relational/multidimensional database that is designed for query and analysis rather than transaction processing. A data warehouse usually contains historical data that is derived from transaction data. It separates analysis workload from transaction workload and enables a business to consolidate data from several sources.

In addition to a relational/multidimensional database, a data warehouse environment often consists of an ETL solution, an OLAP engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

There are three types of data warehouses:
1. Enterprise Data Warehouse(EDW) - An enterprise data warehouse provides a central database for decision support throughout the enterprise.
2. ODS(Operational Data Store) - This has a broad enterprise wide scope, but unlike the real entertprise data warehouse, data is refreshed in near real time and used for routine business activity.
3. Data Mart - Datamart is a subset of data warehouse and it supports a particular region, business unit or business function.

Data warehouses and data marts are built on dimensional data modeling where fact tables are connected with dimension tables. This is most useful for users to access data since a database can be visualized as a cube of several dimensions. A data warehouse provides an opportunity for slicing and dicing that cube along each of its dimensions.

Data Mart: A data mart is a subset of data warehouse that is designed for a particular line of business, such as sales, marketing, or finance. In a dependent data mart, data can be derived from an enterprise-wide data warehouse. In an independent data mart, data can be collected directly from sources.

What is Star Schema?
Star Schema is a relational database schema for representing multimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantage of star schema are slicing down, performance increase and easy understanding of data.

Steps in designing Star Schema
Identify a business process for analysis(like sales).
Identify measures or facts (sales dollar).
Identify dimensions for facts(product dimension, location dimension, time dimension, organization dimension).
List the columns that describe each dimension.(region name, branch name, region name).
Determine the lowest level of summary in a fact table(sales dollar).
Important aspects of Star Schema & Snow Flake Schema
In a star schema every dimension will have a primary key.
In a star schema, a dimension table will not have any parent table.
Whereas in a snow flake schema, a dimension table will have one or more parent tables.
Hierarchies for the dimensions are stored in the dimensional table itself in star schema.
Whereas hierachies are broken into separate tables in snow flake schema. These hierachies helps to drill down the data from topmost hierachies to the lowermost hierarchies.

Hierarchy
A logical structure that uses ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation; for example, in a time dimension, a hierarchy might be used to aggregate data from the Month level to the Quarter level, from the Quarter level to the Year level. A hierarchy can also be used to define a navigational drill path, regardless of whether the levels in the hierarchy represent aggregated totals or not.

Level
A position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the Month, Quarter, and Year levels.

Fact Table
A table in a star schema that contains facts and connected to dimensions. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys.

A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). A fact table usually contains facts with the same level of aggregation.

Example of Star Schema:

In the example figure, sales fact table is connected to dimensions location, product, time and organization. It shows that data can be sliced across all dimensions and again it is possible for the data to be aggregated across multiple dimensions. "Sales Dollar" in sales fact table can be calculated across all dimensions independently or in a combined manner which is explained below.

Sales Dollar value for a particular product
Sales Dollar value for a product in a location
Sales Dollar value for a product in a year within a location
Sales Dollar value for a product in a year within a location sold or serviced by an employee.

Snowflake Schema
A snowflake schema is a term that describes a star schema structure normalized through the use of outrigger tables. i.e dimension table hierachies are broken into simpler tables. In star schema example we had 4 dimensions like location, product, time, organization and a fact table(sales).

In Snowflake schema, the example diagram shown below has 4 dimension tables, 4 lookup tables and 1 fact table. The reason is that hierarchies(category, branch, state, and month) are being broken out of the dimension tables(PRODUCT, ORGANIZATION, LOCATION, and TIME) respectively and shown separately. In OLAP, this Snowflake schema approach increases the number of joins and poor performance in retrieval of data. In few organizations, they try to normalize the dimension tables to save space. Since dimension tables hold less space, Snowflake schema approach may be avoided.

Python

What is Datawarehouse?

1 comment:

Posts