What is TPC-DS and why should you care?
TL-DR; It’s a schema of 24 tables built for a retail scenario. There are 99 complex SQL queries that execute:
- Sequential scans of large amounts of data
- Aggregation of large amounts of data
- Multi-table joins
- Possibly extensive sorting
To try to answer questions like:
Query #57: Find the item brands and categories for each call center and their monthly sales figures for a specified year, where the monthly sales figure deviated more than 10% of the average monthly sales for the year, sorted by deviation and call center. Report the sales deviation from the previous and following month.
Longer answer
I need a sample data set and queries that represent a real life life scenario. I could build something from scratch or export some of the production data or use a public reference. Here is where TPC comes into play. TPC is the Transaction Processing Performance Council and they created a tool to generate data and they provide a bunch of real life business queries that you can run against that generated data. This is a good way to benchmark databases against each other. The latest schema and queries can be found at https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-DS_v3.2.0.pdf.
For instance, StarRocks publishes a benchmark that they are 5x faster than Trino (see benchmark report at https://www.starrocks.io/blog/benchmark-test). They also have benchmarks against other databases like Apache Druid and Clickhouse.