Open Source and Closed Source OLAP databases that can run TPC-H and TPC-DS benchmarks
Before we start, what is TPC-H and TPC-DS and why are they important?
TPC-H and TPC-DS are important because they are industry standard benchmarks for measuring the performance of data warehouse and big data systems. They are widely used by vendors and customers to evaluate the performance of different systems and to compare the performance of the same system over time.
TPC-H is a benchmark for decision support systems that analyze large volumes of data. It is based on a star schema model of a sales database and includes 22 queries that represent common business intelligence workloads.
TPC-DS is a benchmark for big data systems that analyze large volumes of data with complex queries. It is based on a retail data warehouse model and includes 99 queries that represent common big data workloads.
TPC-H and TPC-DS are important for the following reasons:
- They provide a common basis for comparison: TPC-H and TPC-DS allow vendors and customers to compare the performance of different systems on a level playing field. This makes it easier to choose the right system for a particular workload.
- They help to identify performance bottlenecks: TPC-H and TPC-DS can be used to identify performance bottlenecks in data warehouse and big data systems. This information can be used to improve the performance of the systems.
- They promote innovation: TPC-H and TPC-DS help to promote innovation in the data warehouse and big data market. Vendors are constantly working to improve the performance of their systems in order to achieve better TPC-H and TPC-DS results.
First, what is the difference between TPC-H and TPC-DS? TPC-DS has more difficult SQL-like SQL queries with different types of JOINS compared to TPC-H. Most OLAP systems can’t even complete the TPC-H or the more difficult TPC-DS benchmark. Read more at https://atwong.medium.com/what-is-the-difference-between-tpc-h-and-tpc-ds-benchmarks-cb92fc104c32
Second, if you can’t complete and run the TPC-H benchmark, it’s highly likely you can’t run the TPC-DS benchmark. Some of the common reasons you can’t run TPC-DS is because you don’t support all the different types of JOINS (ex: Apache Druid) or use a SQL-like syntax that isn’t SQL ANSI compliant (ex: Clickhouse).
Third, since many OLAP databases can’t run the TPC-H and TPC-DS benchmark, they usually benchmark against a highly modified version of the Star Schema Benchmark (SSB). Some of the modifications are rewriting the query (don’t have SQL X so rewrite it to product-specific query Y) or flatting the schema by using data denormalization so that the query can run (since they don’t support JOINS). This is how they say they’re “faster” by changing the rules or the data structure.
With that said, here’s a short list of databases
- StarRocks (open source): Successful. See https://www.starrocks.io/blog/benchmark-test
- ClickHouse (open source): Failed. They do not strictly follow the standard SQL syntax. They say to use their modified version of Star Schema Benchmark (SSB) which isn’t an apples-to-apples comparison. https://clickhouse.com/docs/en/getting-started/example-datasets/star-schema
- Apache Druid (open source): Failed. They said: Since Druid was built to solve a specific type of problem, we chose a set of benchmarks typical of Druid’s workload that covers the majority of queries we observe in production. Why not use the TPC-H benchmark queries, you may ask? Most of those queries do not directly apply to Druid, and we would have to largely modify the queries or the data to fit the Druid model. See https://druid.apache.org/blog/2014/03/17/benchmarking-druid.html.
- SingleStore (close source): Successful. https://www.singlestore.com/blog/memsql-tpc-benchmarks/
See more about database comparisons at https://github.com/alberttwong/databasecomparison