The Data Lakehouse – Simple, Flexible, and Cost Efficient

Published June 9, 2022

According to the International Data Corporation (IDC), the amount of data created over the next three years will exceed the data created over the past 30 years.  The Data Lakehouse is a new framework/architecture that combines features of previous data management tools, namely the Enterprise Data Warehouse and Data Lake. To understand, what the Data Lakehouse is and what issues it solves compared to its predecessors, we must provide its historical context.

Enterprise Data Warehouses

As the volume of data in transactional systems grew, there was a need to collect and synthesize the data into central warehouses where it could be used for business intelligence (BI) and decision making. Data Warehouses have numerous advantages and have evolved tremendously over the past 3 – 4 decades. They are easy to use, highly reliable, very performant and SQL friendly. In addition, the Data Warehouses follow strict patterns (schema), are ACID transaction compliant, and are well optimized for BI reporting.

Overtime, the data landscape faced another evolution. The volume and variety of data collected grew exponentially which opened new avenues for innovative use cases that were originally not supported by the traditional Data Warehouse. The majority of collected data is now unstructured and/or semi-structured, while traditional DWH only supported structured and schema-defined data. The inability to store unstructured data is a limiting factor as it is utilized for Machine Learning (ML) and Data Science. Another limitation was that the compute and storage layers of traditional DWH were coupled together (especially in on-prem setups); therefore, you had to provision resources to meet your highest load no matter how short that period may be – not the best use of resources.

Data Lakes

These challenges brought to the surface another type of solution – the Data Lake. Data Lakes are repositories which  store  data in all formats and can be utilized for ML and Data Science. This new model began with the popularity of Apache Hadoop and leveraging the Hadoop File System (HDFS) for storage of variety of data and formats. Later, HDFS was replaced by object storage services such as S3, a cheaper, highly durable, and highly available storage system. However, while data lakes solved some issues, they presented various drawbacks:

  • No SQL interface
  • No direct support for BI
  • IT need for ETL of a small subset of data into a DWH
  • Lack of a sophisticated security and data quality enforcement
  • Lack of robust data governance capabilities on object storage

The common approach then was to create a  complex analytics ecosystem with a Data Lake in the center and various use-case specific DWHs. In an effort to bring benefits from both worlds, this tactic introduced immense complexities in maintaining all the systems. Subsequently resulting in two separate copies of the data, incompatible interfaces (SQL, Python, etc.), and incompatible security and governance models.

Data Lakehouse

With the evolution of technology, a need for a new, innovative, and relevant architecture emerged. As a result, data professionals created the Data Lakehouse. The following are the features of a Data Lakehouse:

  • Stores data in multiple formats
  • Infers schema-like data structure and allows for data management similar to a DWH
  • Provides ACID transaction support, upserts & deletes, schema enforcement/schema evolution, file compaction, unification of batch and streaming data, governance and cataloging, and enhanced security through open-source frameworks such as Delta Lake (Databricks), Apache Iceberg (created by Netflix), and Apache Hudi (created by Uber)
  • Reduces ETL processes to create curated datasets
  • Enables analysts to write SQL queries against raw data for ad hoc reporting
  • Connections to popular BI tools
  • Permits Machine Learning tools to directly read from the Lakehouse and access files
  • Separates storage from compute and enhances the flexibility of using multiple compute engines

Modern data analytics frameworks help break down silos, increase collaboration and data literacy, are almost infinitely scalable, and ultimately decrease costs (development time, outages, incorrect data) and increase ROI. Gone are the days of static dashboards built on OLAP sources and traditional reports for descriptive analytics. Now we have the modern Lakehouse architecture as the foundation to enable predictive and prescriptive analytics through the use of Data Science and ML.

Conclusion

The nascent Lakehouse – an all-in-one framework to ingest, catalog, curate, and support both traditional and ML workflows, streaming and batching sources – is not just a new fad…it will be a game-changer for companies across all industries. At Infinitive, we have the know-how to help your organization “get the value out of your data.” For more information on how to implement a modern Data Lakehouse structure in your business, contact us today.

Tihomir Cheresharski

Sales Engineer