State of the Data Lakehouse in 2023: Different types of Data Lakehouses (Hybrid, Open Storage, Open)
So it seems like Data Lakehouses are being slotted in 3 different types.
Hybrid: This makes sense. The Snowflake, AWS RedShift, GCP BigQuery and Microsoft Azure Synapse Analytic sin the world are adding open table formats like Apache Iceberg into read (first) and then write (later). I wouldn’t be surprised that some of them will also support Apache Hudi in the near future.
Open Storage: I think Databricks Photon is the eventual successor to Apache Spark. What is interesting is that Photon isn’t open sourced like Spark was. I believe their strategy is to open source the layers below and then focus on the compute engine. Another thing that is interesting with Delta Lake is that there seems to be 2 different versions. One for paying customers (Databricks Delta)and another for open source users (Open Source Delta Lake).
Open: So this is where all the tiers are open source with the option to buy commercial versions. Trino, being a distributed query engine can already read and write into the popular open table formats like Apache Iceberg, Apache Hudi, and Apache Hive. StarRocks, an open source query engine that that delivers data warehouse performance on the data lake can also read and write (some) in the popular open table formats like Apache Iceberg, Apache Hudi, Delta Lake and Apache Hive.
It’ll be interesting to see how the landscape will change over time.