Best Practices for Data Lakehouse Ingestion

Albert Wong
6 min readAug 12, 2024

--

A data lakehouse combines the best of both worlds: the scalability and cost-efficiency of a data lake with the reliability and governance of a data warehouse. Effective data ingestion is crucial to maximizing the value of your data lakehouse. In this blog post, we’ll explore key best practices for a successful ingestion process.

Understanding Your Data

Before diving into ingestion, it’s essential to have a clear understanding of your data:

  • Data sources: Identify all relevant data sources, including databases, applications, IoT devices, and cloud-based platforms.
  • Data volume and velocity: Assess the volume and speed of data to determine the appropriate ingestion method.
  • Data quality: Define data quality standards and implement validation checks during ingestion.
  • Data schema: Understand the structure of your data to ensure proper mapping and transformation.

Albert’s comments

Focus on KISS (keep it simple).

  • Generally speaking, ingest solutions based on debezium and Apache Spark are going to be the cheapest. The cost is that it’s all OSS projects and coding to glue them together. Vendors like onehouse.ai provide a managed debezium and Apache Spark based ingestion architecture that keeps costs low and ingest speeds fast.
  • Unknowns are database-based ingestion (Trino, Clickhouse, StarRocks) or ELT/ELT solutions (Airbyte). Some solutions are optimized for incremental ingestion, while others are very poor on ingestion speed and cost ratio.

Choosing the Right Ingestion Method

The optimal ingestion method depends on your data characteristics and business requirements:

  • Batch ingestion: Suitable for large, static datasets that don’t require real-time processing.
  • Stream ingestion: Ideal for high-velocity, real-time data streams.
  • Change Data Capture (CDC): Efficiently captures and delivers data changes from transactional databases.

Albert’s comments

Open Table formats matter. Use kappa data architecture instead of lambda data architecture to address batch and stream ingestion. For CDC, the best solution is debezium feeding into the kappa data architecture.

  • Apache Iceberg: Not really performant for ingestion in general due to the maturity of merge-on-read (copy-on-write isn’t designed for incremental ingestion). However, if ingestion latency isn’t a big deal, just pay the cost and move on. Common path are — Debezium to Kafka to Kafka Sink or Debezium to Kafka to SQL insert or File to Spark Job or File to Application/Tool.
  • Delta Lake: The COW and MOR capabilities are very mature and performant. Common paths are — Debezium to Kafka to Spark Job or File to Spark Job.
  • Apache Hudi: The COW and MOR capabilities are very mature and performant. Common paths are — Debezium to Kafka to Spark Job or File to Spark Job. Hudi does provide a specialized tool to help with ingestion called HudiDeltaStream and is the easiest and most feature rich compared to Kafka Hudi Sink.

Commercial vendors that specialize in data lakehouse ingestion are onehouse.ai, Fivetran and Upsolver since they don’t just do ingestion but also perform table optimization services (eg. fix data skew across file and partition).

Data Transformation and Enrichment

Transform and enrich data during ingestion to improve its value:

  • Data cleaning: Remove duplicates, inconsistencies, and errors.
  • Data standardization: Apply consistent formats and units.
  • Data enrichment: Add context or additional information.
  • Data partitioning: Optimize query performance by partitioning data based on relevant attributes.

Albert’s comments

Most projects/vendors just do ELT which means land the data in S3 and then do a transformation. Another common pattern is to land the data and then use Airflow or Dagster to chain a series of transformations OR execute your transformation with code in Spark or configuration in Kafka. Among vendors, Onehouse.ai allows transformations during ingest by coding in java. During my research, I am not aware of others who provide a solution for this problem.

Data Quality and Validation

Implement robust data quality checks to ensure data integrity:

  • Schema validation: Verify data conforms to the expected schema.
  • Data consistency: Check for inconsistencies and anomalies.
  • Data completeness: Identify missing values or records.
  • Data accuracy: Validate data against known reference data.

Albert’s comments

If you use Spark, you can solve this issue through code. If you’re not using Spark, you can use an application/tool to solve this problem. Among vendors, Onehouse.ai allows data quality and validation during ingest by coding in java. During my research, I am not aware of others who provide a solution for this problem.

Storage Optimization

Optimize storage costs and query performance:

  • Data compression: Reduce storage footprint and improve read performance.
  • Data partitioning: Improve query performance by partitioning data based on frequently used columns.
  • Table services: Enhance ingestion and query performance of all data lakehouses with table cleaning, clustering, compaction, file-sizing, and more For example when data writes in storage, it just writes it. As a result, you might have data skews across files and partitions. This is where compaction helps.

Albert’s comments

  • For data compression, even though you can change it, most people don’t because with more compression, you get worse performance. See articles like https://siraj-deen.medium.com/the-battle-of-the-compressors-optimizing-spark-workloads-with-zstd-snappy-and-more-for-parquet-82f19f541589 for details about each compression algorithm.
  • For data partitioning, this is really really important in picking the right partitioning. There is a lot to go over but generally speaking, the idea is to match your query to scan the right sized partition. See https://www.onehouse.ai/blog/knowing-your-data-partitioning-vices-on-the-data-lakehouse for more details.
  • For table services, this is a must have. It’s great to be able to write data but writing data to a data lakehouse tends to be “not evenly distributed”. You tend to see data skews across files and partitions, which leads to an environment that is unoptimized (just like back in the old days when you didn’t defrag your disk drive). If you run table services (compaction, cleaning, file resizing, clustering) like what is provided by Apache Hudi and others, you can gain 2x to 10x in performance. If you want a free tool to observe your data skew, check out onehouse.ai’s lakeview product.

Metadata Management

Effectively manage metadata for discoverability and governance:

  • Data cataloging: Create a comprehensive data catalog with detailed metadata.
  • Data lineage: Track data transformations and origins.
  • Data governance: Implement policies and procedures for data access and usage.

Albert’s comments

Each open table format requires the use of a data catalog (a deployment may have 10s to 100s to 1000s of open table format files across multiple folders). Although there are many options, the only data catalog you can use across iceberg, hudi and delta lake is Apache Hive Metastore (HMS). Each open table format provides tools to help with metadata management like clean up of old data and their corresponding metadata or consolidation of files and their corresponding metadata.

Monitoring and Optimization

Continuously monitor ingestion performance and optimize as needed:

  • Ingestion metrics: Track ingestion speed, volume, and error rates.
  • Performance tuning: Identify and address bottlenecks.
  • Error handling: Implement robust error handling and recovery mechanisms.

Albert’s comments

As far as monitoring of ingestion metrics and performance

  • Apache Iceberg: Can do it with Tabular but not sure what is the status after the Databricks acquisition.
  • Delta Lake: This can be done via Databricks LakeHouse Monitoring.
  • Apache Hudi: Onehouse.ai’s Lakeview which is free with registration (currently support Hudi but plans for iceberg and delta lake in the future).

As far as optimization,

Security and Compliance

Protect sensitive data with appropriate security measures:

  • Data encryption: Encrypt data at rest and in transit.
  • Access control: Implement role-based access controls.
  • Data masking: Protect sensitive data through anonymization or pseudonymization.
  • Compliance: Adhere to relevant data privacy regulations (e.g., GDPR, CCPA).

By following these best practices, you can build a reliable and efficient data ingestion pipeline that supports your data lakehouse and delivers valuable insights.

Albert’s comments

People do this in different ways.

  • Data Encryption: Usually there are 3 types, encryption in memory, in flight and on disk. You have to implement this as a customization in your pipeline or tooling.
  • Access Control: Apache Iceberg has its own security model. Delta Lake uses Apache Ranger. Finally, there is unconfirmed work with Apache Hudi and Apache Ranger.
  • Data Masking: You have to implement this as a customization in your pipeline or tooling.
  • Compliance: You would need to work with a vendor for a solution in this space.

--

--

Albert Wong

#eCommerce #Java #Database #k8s #Automation. Hobbies: #BoardGames #Comics #Skeet #VideoGames #Pinball #Magic #YelpElite #Travel #Candy