2023 Modern Open Source Data Stack for Data Warehouse and Data Lakehouse

Albert Wong
3 min readNov 16, 2023
2023 Modern Open Source Data Stack for Data Warehouse and Data Lakehouse

I get this question a lot from the community that I manage. What would I build and use if I did a modern data stack for data warehouse / data lakehouse.

Requirements:

  • OLAP database that can meet sub-second query times and single digit second ingestion times. Oh, you can also do JOINs at scale and be performant so I can reduce the amount of data engineering (no need for denormalization, minimize/reduce the amount of transformations).
  • Use Open Source or Commercial Open Source where we can.
  • Be cost optimal (using OLAP separation of compute and storage).
  • Cloud Native.
  • Be able to run as on premise, self-managed or through a managed service.
  • Run on open standards.

Here’s what I would use and why:

  • OLAP: StarRocks. StarRocks was designed to address the challenges of real-time analytics, including the need to support high concurrency, low latency, a wide range of analytical workloads and offers the ability to query data directly from data lakes. StarRocks received InfoWorld’s 2023 BOSSIE Award for best open source software.
  • Open Table Format: Apache Iceberg (http://tabular.io) or Apache Hudi (http://onehouse.ai). Both formats provide features beyond Parquet like indexing and ACID transactions and are the leading open table formats. See more about the differences in formats. https://atwong.medium.com/my-favorite-articles-to-understand-the-differences-between-open-table-formats-apache-hudi-apache-de0bd760eead. Most people choose one open table format but why not both? One unique ability of StarRocks is that you can have Apache Iceberg data side-by-side with Apache Hudi data. This allows you to not only query both formats but create things like materialized view across both formats.
  • Storage format: Parquet. Better than CSV or others since schema is part of the file itself and there is compression. Lot of other advantages.
  • Storage: S3 compatible. Lot of Cloud Services provide object store and you can also use Ceph or Minio.
  • Business Intelligence and Data Visualization: Apache SuperSet (http://preset.io). It’s like an open source version of the most used features in Tableau.
  • Streaming technology: Apache Kafka. Easy way to to 1-to-many, many-to-many, many-to-1 to source or sink your data. Now you get real time data from your environment and sink it into StarRocks.
  • Batch Scheduling: AirByte. To me, it’s like the open source version of Informatica ETL.
Query data on top of the lake, support performant JOINS at scale, support 1000s of users doing adhoc queries
Run StarRocks on top of raw data and then create views or materialized views as needed.
AirBnB with StarRocks: 4 JOINS with billions of rows in under 4 seconds
Tencent Games with StarRocks: 400+ users doing ad hoc queries on xx+ petabytes of data on Apache Iceberg files.

--

--

Albert Wong

#eCommerce #Java #Database #k8s #Automation. Hobbies: #BoardGames #Comics #Skeet #VideoGames #Pinball #Magic #YelpElite #Travel #Candy