Mastering Data Warehouses and Unlocking the Power of Cloud-Based Solutions

Dr. Arun Kumar Pandey (Ph.D.)
8 min readJun 12, 2023

--

A database is a structured collection of data that is organized and stored in a computer system. It can be used to store and manage various types of data, such as customer information, sales data, inventory records, and much more.

Database management involves the tasks of creating, maintaining, securing, and optimizing a database to ensure that it can be used effectively and efficiently. This typically involves the use of specialized software known as a database management system (DBMS) that provides tools and features to manage the database.

There are different types of DBMSs, including:

  1. Relational DBMS — organizes data in tables with rows and columns, and uses SQL to manipulate the data. Examples: MySQL, Oracle Database, Microsoft SQL Server, PostgreSQL, SQLite, BigQuery
  2. NoSQL DBMS — used for managing unstructured data such as social media posts, multimedia content, and documents. Examples: MongoDB, Cassandra, Redis.
  3. Object-Oriented Databases (OODBMS): Object-oriented databases store objects directly, preserving their relationships and behaviors. They are suitable for object-oriented programming languages and applications that heavily rely on object-oriented concepts. OODBMSs provide features like encapsulation, polymorphism, and inheritance. Examples include ObjectDB and ObjectStore.
  4. Hierarchical Databases: Hierarchical databases organize data in a tree-like structure, where each record has a parent-child relationship. These databases were popular in early mainframe systems but are less prevalent today. IBM’s Information Management System (IMS) is an example of a hierarchical database.
  5. Network Databases: Network databases are similar to hierarchical databases, but they allow more complex relationships between records. They use a network model, where records can have multiple parent and child relationships. The CODASYL (Conference on Data Systems Languages) database management system is an example of a network database.
  6. Distributed Databases: Distributed databases store data across multiple computers or servers, connected by a network. They offer advantages like high availability, fault tolerance, and scalability. Distributed databases are commonly used in large-scale applications and systems that require data replication and distribution across multiple locations.
  7. Data Warehouses: A data warehouse is a centralized repository that integrates and stores large volumes of structured, historical, and potentially heterogeneous data from various sources within an organization. It is designed to support the analysis and reporting needs of business intelligence and decision-making processes. A data warehouse provides a unified and consistent view of data, making it easier to extract valuable insights, identify trends, and make informed business decisions.

Data warehouse

A data warehouse is an electronic system that gathers data from a wide range of sources within a company and uses the data to support management decision-making.

In general data warehouse's basic structure is shown in the following figure:

Image credit: https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/
  • The basic structure lets end users of the warehouse directly access summary data derived from source systems and perform analysis, reporting, and mining on that data. This structure is useful when data sources derive from the same types of database systems. It only consists of data sources, Warehouse, and users.
  • A warehouse with a staging area is the next logical step in an organization with disparate data sources with many different types and formats of data. The staging area converts the data into a summarized structured format that is easier to query with analysis and reporting tools. In this case, other than the Data marts, all other stages are involved.
  • A variation on the staging structure is the addition of data marts to the data warehouse. The data marts store summarized data for a particular line of business, making that data easily accessible for specific forms of analysis. For example, adding data marts can allow a financial analyst to easily perform detailed queries on sales data, to make predictions about customer behavior. Data marts make analysis easier by tailoring data specifically to meet the needs of the end user. In this case, the figure shown above is valid.

Types of Warehouses or Data Warehouse Models

There are different types of data warehouses based on their architecture and purpose. Some common examples include:

  • Enterprise Data Warehouse (EDW): An EDW serves as the central repository for all organizational data, integrating data from various sources across the entire enterprise. It provides a comprehensive view of the business and supports enterprise-wide reporting and analysis.
  • Operational Data Store (ODS): An ODS is a database that contains real-time or near-real-time data from multiple operational systems. It acts as an intermediate layer between operational systems and the data warehouse, providing more immediate access to data for operational reporting and monitoring.
  • Data Mart: A data mart is a subset of a data warehouse focused on a specific business function, department, or user group. It contains a tailored set of data and is designed to address the specific analytical needs of a particular business area, such as sales, marketing, or finance.
  • Virtual Data Warehouse (VDW): A VDW is a concept where data from different sources is virtually integrated and presented as a unified view without physically storing the data in a single location. It provides a logical abstraction layer for querying and analysis, enabling organizations to leverage data from multiple sources without the need for physical consolidation.

Architecture

Data warehouses typically follow a specific architecture that involves three main components:

  • the data sources,
  • the ETL (Extract, Transform, Load) process, and
  • the data warehouse itself.
(Image credit: https://corporatefinanceinstitute.com/resources/business-intelligence/data-warehousing/)

The data sources can include transactional databases, operational systems, external data feeds, spreadsheets, and other structured data sources. The ETL process extracts the data from these sources, applies transformations to clean, integrate, and format it, and then loads it into the data warehouse. The data warehouse itself is optimized for querying and analysis, with structured schemas, indexes, and other optimizations in place to facilitate efficient data retrieval.

Three-Tier Architecture:

Traditional data warehouse architecture employs a three-tier structure composed of the following tiers.

  1. Bottom tier: This tier contains the database server used to extract data from many different sources, such as from transactional databases used for front-end applications.
  2. Middle tier: The middle tier houses an OLAP server, which transforms the data into a structure better suited for analysis and complex querying. The OLAP server can work in two ways: either as an extended relational database management system that maps the operations on multidimensional data to standard relational operations(Relational OLAP), or using a multidimensional OLAP model that directly implements the multidimensional data and operations.
  3. Top tier: The top tier is the client layer. This tier holds the tools used for high-level data analysis, querying reporting, and data mining.
(Image credit: https://panoply.io/data-warehouse-guide/data-warehouse-architecture-traditional-vs-cloud/)

Ways to structure data warehouse

There are several ways to structure a data warehouse, depending on the specific needs and requirements of the organization. Here are some common approaches to structuring a data warehouse:

  1. Star Schema: The star schema is a widely used structure in data warehousing. It consists of a central fact table surrounded by multiple dimension tables. The fact table contains the measurements or facts of the business, such as sales or revenue, and the dimension tables provide context to these facts. Each dimension table represents a different aspect of the business, such as time, product, customer, or location. The star schema offers simplicity, ease of understanding, and efficient query performance.
  2. Snowflake Schema: The snowflake schema is an extension of the star schema. It expands on the dimension tables by normalizing them into multiple levels. In a snowflake schema, dimension tables are connected through hierarchical relationships, resulting in a more complex structure. This schema can be useful when dealing with dimensions that have many attributes and hierarchies. However, it may require more complex query joins and can potentially impact performance.
  3. Fact Constellation (Galaxy) Schema: The fact constellation schema, also known as the galaxy schema, is a more complex structure that consists of multiple fact tables sharing dimension tables. It is suitable when there are multiple fact tables representing different business processes or areas, but they share common dimensions. The fact constellation schema offers greater flexibility and can support complex analysis involving multiple fact tables.
  4. Data Vault: The Data Vault methodology is a modeling technique that focuses on scalability, flexibility, and historical tracking of data. It involves separating the data into three main types of tables: the hub tables representing core business entities, the satellite tables containing descriptive information about the hubs, and the link tables that capture the relationships between hubs. Data Vault structures are designed to handle large volumes of data, accommodate changes over time, and provide traceability of data.
  5. Hybrid Approaches: In practice, many data warehouses use a combination of different structures to meet specific needs. This can involve a mix of star schemas, snowflake schemas, and other modeling techniques. Hybrid approaches allow organizations to strike a balance between simplicity, performance, and flexibility by adopting different structures based on the specific requirements of different parts of the data warehouse.

The choice of data warehouse structure depends on factors such as the complexity of the data, the analytical requirements, the organization’s reporting needs, and the scalability and performance considerations. It’s important to carefully analyze the data and business requirements before determining the appropriate structure for a data warehouse.

New Data Warehouse Architectures

In recent years, new data warehouse architectures have emerged to address the evolving needs of organizations and the increasing volume, variety, and velocity of data. Here are a few notable new data warehouse architectures:

  1. Cloud Data Warehouse: A scalable and flexible data warehouse architecture that leverages cloud computing infrastructure. Example: Google BigQuery, Snowflake, Amazon Redshift
  2. Data Lake: A repository for storing raw and unstructured data in its native format. Examples: Apache Hadoop, Amazon S3, Azure Data Lake Storage.
  3. Real-Time Data Warehouse: Architecture that enables near real-time or real-time data ingestion and processing for immediate analysis. Example: Confluent Platform, Apache Kafka, MemSQL.
  4. Hybrid Data Warehouse: Combines on-premises and cloud-based data warehousing, providing flexibility and scalability. Example: Google Cloud BigQuery Omni, Microsoft Azure Synapse Analytics
  5. Logical Data Warehouse: An abstraction layer that provides a unified view of data from multiple sources without physical consolidation. Example: Denodo Platform, Apache Calcite, SAP HANA Virtual Data Model.

Advantages

In general, cloud-based DBMS offers several advantages over traditional on-premise DBMS, such as:

  • Scalability: Cloud-based DBMS can easily scale up or down depending on the needs of the business. This allows for cost-effective data management and storage.
  • Accessibility: Cloud-based DBMS can be accessed from anywhere with an internet connection, which makes it easy for multiple users to collaborate and work on the same data.
  • Security: Cloud-based DBMS typically have built-in security features that protect the data from unauthorized access, such as encryption and user authentication.
  • Cost-effectiveness: Cloud-based DBMS can be more cost-effective than traditional on-premises solutions since there is no need to invest in expensive hardware and infrastructure.
  • Performance: Cloud-based DBMS can offer high performance and low latency, allowing for real-time processing and analysis of large amounts of data.

Reference

  1. Check my GitHub repository: https://github.com/arunp77/SQL/tree/main
  2. https://www.javatpoint.com/sql-tutorial
  3. https://www.youtube.com/watch?v=l8DCPaHc5TQ&ab_channel=Chandoo
  4. https://www.youtube.com/watch?v=HXV3zeQKqGY&list=WL&index=53&t=9458s&ab_channel=freeCodeCamp.org

--

--