Top Open Source Software for mirroring/offload/CDC data from OLTP to Data Lakehouse or OLAP

3 min readMar 20, 2025
mirroring/offload/CDC data from OLTP to Data Lakehouse

Offloading data from an Online Transaction Processing (OLTP) system to a Data Lakehouse or an Online Analytical Processing (OLAP) system is a common practice in data warehousing. Here are several key reasons why this is beneficial:

Improved Data Lakehouse/OLTP Performance:

  • OLTP systems are designed for fast, real-time transaction processing. Complex analytical queries can significantly slow down these systems, impacting their ability to handle day-to-day operations. Offloading data to data lakehouse/OLAP system allows the analytical system to focus on its primary function.

Enhanced Analytical Capabilities:

  • Data lakehouse/OLAP systems are optimized for complex analytical queries, data mining, and reporting. They can efficiently handle large volumes of historical data, enabling businesses to gain valuable insights.
  • Data lakehouse/OLAP systems are designed to handle complex queries that include aggregations, joins, and other data manipulations that are not optimized within an OLTP system.

Historical Data Analysis:

  • Data lakehouse/OLAP systems are designed to store and analyze historical data, which is essential for trend analysis, forecasting, and other business intelligence activities. OLTP systems typically only store recent data.

Costs:

  • Data stored in a OLTP is much much more expensive than data lakehouse/OLAP. OLTP databases like mySQL and postgres store their data in block disks like AWS EBS. Data Lakes and OLAP systems typically store their data in object stores like AWS S3. If you’re looking at data storage costs, if it's $1 to store data in AWS EBS, the same amount of data can be $0.10 in AWS S3.

So now that you’ve decided to mirror/offload/cdc your data from your OLTP to data lakehouse/OLAP because it’s cheaper/faster/better, what are your options for moving your data?

Debezium:

  • Pros: The easiest way to mirror data in real time. Has various connectors for the source, which monitors the transaction log and then queues the data using Apache Kafka.
  • Cons: Operational management can be intense when you scale.
  • Who provides managed services: Onehouse.ai (as a part of their data lakehouse solution), Confluent (as a part of their Apache Kafka solution)

Apache Kafka:

  • Pros: Very well tested, many users, no coding needed.
  • Cons: Scalability and performance. Limited complex event processing.
  • Who provides managed services: Confluent, Amazon, Azure, Google Cloud, Red Panda

Apache Flink

  • Pros: Scalability and performance.
  • Cons: Need to have Java developers for building and maintaining complex stream processing. Just doesn’t work out of the box (OOTB).
  • Who provides managed services: Confluent, Amazon, Ververica

Apache Spark

  • Pros: Lowest costs for large amounts of data processing.
  • Cons: Need for developers to write the data processing.
  • Who provides managed services: Onehouse, Amazon, Google Cloud, Databricks, Azure

Finally let’s talk about the final destination for your data

Data lakehouse

A data lakehouse aims to serve as the foundational, raw data storage for analytical databases. To achieve this, it should:

  • Support multiple open table formats (e.g., Hudi, Iceberg, Delta Lake), ensuring interoperability between different tools and engines.
  • Leverage object storage (e.g., S3) for cost-effective storage of large datasets.
  • Provide database-like features, such as indexing, to enhance query performance and data management.

Simply storing Parquet files in an object store does not constitute a true data lakehouse. The core distinguishing factor is the addition of a metadata layer and transactional capabilities provided by open table formats.

OLAP

  • Pros: Vertically integrated solution.
  • Cons: Costs. Single vendor/project.

Overall, if you have a lot of data, an Apache Spark-based solution is the best due to costs. If you want the easiest, Debezium is your best solution due to it’s purpose built nature to solve OLTP to data lakehouse/OLAP offloading.

--

--

Albert Wong
Albert Wong

Written by Albert Wong

#eCommerce #Java #Database #k8s #Automation. Hobbies: #BoardGames #Comics #Skeet #VideoGames #Pinball #Magic #YelpElite #Travel #Candy

No responses yet