2023 Modern Open Source Data Stack for Data Warehouse and Data Lakehouse
3 min readNov 16, 2023
I get this question a lot from the community that I manage. What would I build and use if I did a modern data stack for data warehouse / data lakehouse.
Requirements:
- OLAP database that can meet sub-second query times and single digit second ingestion times. Oh, you can also do JOINs at scale and be performant so I can reduce the amount of data engineering (no need for denormalization, minimize/reduce the amount of transformations).
- Use Open Source or Commercial Open Source where we can.
- Be cost optimal (using OLAP separation of compute and storage).
- Cloud Native.
- Be able to run as on premise, self-managed or through a managed service.
- Run on open standards.
Here’s what I would use and why:
- OLAP: StarRocks. StarRocks was designed to address the challenges of real-time analytics, including the need to support high concurrency, low latency, a wide range of analytical workloads and offers the ability to query data directly from data lakes. StarRocks received InfoWorld’s 2023 BOSSIE Award for best open source software.
- Open Table Format: Apache Iceberg (http://tabular.io) or Apache Hudi (http://onehouse.ai). Both formats provide features beyond Parquet like indexing and ACID transactions and are the leading open table formats. See more about the differences in formats. https://atwong.medium.com/my-favorite-articles-to-understand-the-differences-between-open-table-formats-apache-hudi-apache-de0bd760eead. Most people choose one open table format but why not both? One unique ability of StarRocks is that you can have Apache Iceberg data side-by-side with Apache Hudi data. This allows you to not only query both formats but create things like materialized view across both formats.
- Storage format: Parquet. Better than CSV or others since schema is part of the file itself and there is compression. Lot of other advantages.
- Storage: S3 compatible. Lot of Cloud Services provide object store and you can also use Ceph or Minio.
- Business Intelligence and Data Visualization: Apache SuperSet (http://preset.io). It’s like an open source version of the most used features in Tableau.
- Streaming technology: Apache Kafka. Easy way to to 1-to-many, many-to-many, many-to-1 to source or sink your data. Now you get real time data from your environment and sink it into StarRocks.
- Batch Scheduling: AirByte. To me, it’s like the open source version of Informatica ETL.