2023 Modern Open Source Data Stack for Data Warehouse and Data Lakehouse

Albert Wong
2 min readNov 16, 2023
2023 Modern Open Source Data Stack for Data Warehouse and Data Lakehouse

I get this question a lot from the community that I manage. What would I build and use if I did a modern data stack for data warehouse / data lakehouse.

Requirements:

  • OLAP database that can meet sub-second query times and single digit second ingestion times. Oh, you can also do JOINs at scale and be performant so I can reduce the amount of data engineering (no need for denormalization, minimize/reduce the amount of transformations).
  • Use Open Source or Commercial Open Source where we can.
  • Be cost optimal (using OLAP separation of compute and storage).
  • Cloud Native.
  • Be able to run as on premise, self-managed or through a managed service.
  • Run on open standards.

Here’s what I would use and why:

  • OLAP: StarRocks. StarRocks was designed to address the challenges of real-time analytics, including the need to support high concurrency, low latency, a wide range of analytical workloads and offers the ability to query data directly from data lakes. StarRocks received InfoWorld’s 2023 BOSSIE Award for best open source software.
  • Open Table Format: Apache Iceberg (http://tabular.io) or Apache Hudi (http://onehouse.ai). Both formats provide features beyond Parquet like indexing and ACID transactions and are the leading open table formats. See more about the differences in formats. https://atwong.medium.com/my-favorite-articles-to-understand-the-differences-between-open-table-formats-apache-hudi-apache-de0bd760eead. Most people choose one open table format but why not both? One unique ability of StarRocks is that you can have Apache Iceberg data side-by-side with Apache Hudi data. This allows you to not only query both formats but create things like materialized view across both formats.
  • Storage format: Parquet. Better than CSV or others since schema is part of the file itself and there is compression. Lot of other advantages.
  • Storage: S3 compatible. Lot of Cloud Services provide object store and you can also use Ceph or Minio.
  • Business Intelligence and Data Visualization: Apache SuperSet (http://preset.io). It’s like an open source version of the most used features in Tableau.
  • Streaming technology: Apache Kafka. Easy way to to 1-to-many, many-to-many, many-to-1 to source or sink your data. Now you get real time data from your environment and sink it into StarRocks.
  • Batch Scheduling: AirByte. To me, it’s like the open source version of Informatica ETL.

--

--

Albert Wong

#eCommerce #JavaEE #Database #k8s. Hobbies: #BoardGames #Comics #Skeet #VideoGames #Pinball #Magic #YelpElite #Travel #Candy