StarRocks query performance with Apache Hudi and Onehouse

Albert Wong
2 min readOct 11, 2023

--

Apache Hudi and Onehouse + StarRocks

To prepare for an upcoming demo, I created an environment where I generated 1GB of TPC-DS data, put it in PostgreSQL and then have Onehouse.ai consume the data in real time and create Apache Hudi tables. See https://atwong.medium.com/easiest-way-to-load-tpc-ds-data-into-postgresql-1ebd83871a07 for more details.

Once the data is in Apache Hudi, I used StarRocks to connect to Apache Hudi via the Apache Hudi external catalog feature and then I ran TPC-DS query 57.

Here is what TPC-DS query 57 looks like

with v1 as(
select i_category, i_brand, cc_name, d_year, d_moy,
sum(cs_sales_price) sum_sales,
avg(sum(cs_sales_price)) over
(partition by i_category, i_brand,
cc_name, d_year)
avg_monthly_sales,
rank() over
(partition by i_category, i_brand,
cc_name
order by d_year, d_moy) rn
from public_item_ro, public_catalog_sales_ro, public_date_dim_ro, public_call_center_ro
where cs_item_sk = i_item_sk and
cs_sold_date_sk = d_date_sk and
cc_call_center_sk= cs_call_center_sk and
(
d_year = 1999 or
( d_year = 1999-1 and d_moy =12) or
( d_year = 1999+1 and d_moy =1)
)
group by i_category, i_brand,
cc_name , d_year, d_moy),
v2 as(
select v1.i_category ,v1.d_year, v1.d_moy ,v1.avg_monthly_sales
,v1.sum_sales, v1_lag.sum_sales psum, v1_lead.sum_sales nsum
from v1, v1 v1_lag, v1 v1_lead
where v1.i_category = v1_lag.i_category and
v1.i_category = v1_lead.i_category and
v1.i_brand = v1_lag.i_brand and
v1.i_brand = v1_lead.i_brand and
v1.cc_name = v1_lag.cc_name and
v1.cc_name = v1_lead.cc_name and
v1.rn = v1_lag.rn + 1 and
v1.rn = v1_lead.rn - 1)
select *
from v2
where d_year = 1999 and
avg_monthly_sales > 0 and
case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) / avg_monthly_sales else null end > 0.1
order by sum_sales - avg_monthly_sales, 3
limit 100;

The raw results can be found at https://github.com/StarRocks/starrocks/discussions/32531

Note: This is for only one TPC-DS query. There are 100 of them so it doesn’t mean that every single one of those queries will be faster/slower.

The summary was that the Apache Hudi RO and RT tables ran the queries cold in ~7 seconds. Executing the queries again, StarRocks ran the queries in ~3 seconds.

--

--

Albert Wong
Albert Wong

Written by Albert Wong

#eCommerce #Java #Database #k8s #Automation. Hobbies: #BoardGames #Comics #Skeet #VideoGames #Pinball #Magic #YelpElite #Travel #Candy

No responses yet