Easiest way to load TPC-DS data into PostgreSQL

Albert Wong
2 min readOct 6, 2023

--

At StarRocks/Celerdata, we do a lot of testing. One the tests I had to run was to do a CDC scenario where I’m loading in TPC-DS data and then having it available in StarRocks. Here are the steps that I used.

  1. Use duckDB to generate the TPC-DS data set

Install duckDB and then run the following commands in the duckdb shell.

INSTALL tpcds;
LOAD tpcds;
SELECT * FROM dsdgen(sf=1);
EXPORT DATABASE 'TARGET_DIR' (FORMAT CSV, DELIMITER '|');
INSTALL tpcds;
LOAD tpcds;
SELECT * FROM dsdgen(sf=1);
EXPORT DATABASE '/Users/albert/sandbox/data/' (FORMAT CSV, DELIMITER '|');

This will generate 24 tables and 2 scripts (schema.sql and load.sql)

Note: Generating TPC-DS data is painful. TPC.org scripts are not user friendly. Using duckDB is the easiest way by FAR.

Another Note: If you’re going to load data for another OLAP database, I would export to parquet format. Since we are loading into a OLTP database, I used CSV.

2. Use DBeaver to create the tables and load the data

Open DBeaver and create a connection to the postgres instance. Launch a new SQL window and execute schema.sql to create all the tables.

Once the tables have all been created, you can use the import feature in DBeaver to load the data. Make sure you change the default settings for

  • Column delimiter to “|”
  • Header position to “none”
  • Quote char is already set to “

I also used the “bulk load” setting. It seemed to make the import process faster.

3. Setting up primary key for the imported tables.

Next, you have to setup primary keys for the tables you’ve imported. You do this by looking at the schema design at https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.6.0.pdf. It shows you which column(s) are primary keys.

One of the tables I was using (catalog_sales) has a compound primary key.

Note: I did not bother to implement foreign keys (FP).

4. Ready for use

Once that has been done, postgresql is now ready to be a source for Apache Kafka and Debezium.

Full architecture is: PostgreSQL (with TPC-DS data) -> Onehouse.ai (runs Apache Kafka in the backend to transform data from PostgreSQL to Apache Hudi on S3) -> StarRocks / CelerData -> Apache SuperSet / Preset.io.

--

--

Albert Wong
Albert Wong

Written by Albert Wong

#eCommerce #Java #Database #k8s #Automation. Hobbies: #BoardGames #Comics #Skeet #VideoGames #Pinball #Magic #YelpElite #Travel #Candy

No responses yet