Easiest way to load TPC-DS data into StarRocks
At StarRocks/Celerdata, we do a lot of testing. One the tests I had to run was TPC-DS benchmarking in StarRocks. Here are the steps that I used.
- Use duckDB to generate the TPC-DS data set
Install duckDB and then run the following commands in the duckdb shell.
INSTALL tpcds;
LOAD tpcds;
SELECT * FROM dsdgen(sf=1);
EXPORT DATABASE ‘TARGET_DIR’ (FORMAT PARQUET);
This will generate 24 tables and 2 scripts (schema.sql and load.sql)
Note: Generating TPC-DS data is painful. TPC.org scripts are not user friendly. Using duckDB is the easiest way by FAR.
Tip: If you’re going to load data for an OLAP database, I would export to parquet format. If you are going to load data for an OLTP database, use CSV.
2. Load the parquet files into StarRocks
Using https://docs.starrocks.io/en-us/latest/sql-reference/sql-functions/table-functions/files as a reference, run the follow SQL command.
INSERT INTO item SELECT * FROM FILES( “path” = “s3://item.parquet”, “format” = “parquet”, “aws.s3.access_key” = “XXXXXXXXXX”, “aws.s3.secret_key” = “YYYYYYYYYY”, “aws.s3.region” = “us-west-2”