The hottest area of database design: Querying billions of rows per second with SIMD

Albert Wong
2 min readJul 18, 2023

--

As a database engineer, people ask me all the time, “What is the hottest area of database design” right now. It’s using SIMD which stands for Single instruction, multiple data, to process a lot of data very, very fast.

Scalar vs SIMD Operation

Josh Weinstein said it best (full article linked below).

SIMD. Single instruction multi data. You may not have heard of these four words before, but they have the power to make software run at lightning speed. They can accelerate actions like copying or searching data 10x, 20x or more times faster than with traditionally written code. The CPUs that power our computers today possess a special set of instructions that can process data simultaneously, and in parallel. In fact, the sets of these instructions have been around for a number of years. They are seldom explored or discussed, but have the potential to provide unparalleled performance in a world of ever growing software capacity.

This idea of using SIMD has been around for a while. Academic papers like http://www.cs.columbia.edu/~kar/pubsk/simd.pdf and https://15721.courses.cs.cmu.edu/spring2016/papers/p1493-polychroniou.pdf have been written as early as 2000s about its potential use but it has only been recently that database developers started to put it into a database product.

As of right now, I’m only aware of 4 databases that use SIMD as the core of their query layer: StarRocks (OLAP), Apache Druid (OLAP), ClickHouse (OLAP) and QuestDB (time-series). All of them are fast however among the OLAP DBMS that I mentioned, only StarRocks does performant JOINS at scale. Read more about how StarRocks implements at https://docs.starrocks.io/en-us/2.5/introduction/Features

Performance difference of various OLAP databases
Graphic of JOIN performance using the TPC-DS test data between StarRocks and Trino (I would expect similar with AWS Athena and PrestoDB)
Graphic of SSB Flat Table Benching among SIMD database StarRocks, ClickHouse and Apache Druid. Note: ClickHouse and Druid partially support JOINS so we compared denormalized tables.

More info at https://github.com/alberttwong/databasecomparison

--

--

Albert Wong
Albert Wong

Written by Albert Wong

#eCommerce #Java #Database #k8s #Automation. Hobbies: #BoardGames #Comics #Skeet #VideoGames #Pinball #Magic #YelpElite #Travel #Candy

No responses yet