The hottest area of database design: Querying billions of rows per second with SIMD
As a database engineer, people ask me all the time, “What is the hottest area of database design” right now. It’s using SIMD which stands for Single instruction, multiple data, to process a lot of data very, very fast.
Josh Weinstein said it best (full article linked below).
SIMD. Single instruction multi data. You may not have heard of these four words before, but they have the power to make software run at lightning speed. They can accelerate actions like copying or searching data 10x, 20x or more times faster than with traditionally written code. The CPUs that power our computers today possess a special set of instructions that can process data simultaneously, and in parallel. In fact, the sets of these instructions have been around for a number of years. They are seldom explored or discussed, but have the potential to provide unparalleled performance in a world of ever growing software capacity.
This idea of using SIMD has been around for a while. Academic papers like http://www.cs.columbia.edu/~kar/pubsk/simd.pdf and https://15721.courses.cs.cmu.edu/spring2016/papers/p1493-polychroniou.pdf have been written as early as 2000s about its potential use but it has only been recently that database developers started to put it into a database product.
As of right now, I’m only aware of 4 databases that use SIMD as the core of their query layer: StarRocks (OLAP), Apache Druid (OLAP), ClickHouse (OLAP) and QuestDB (time-series). All of them are fast however among the OLAP DBMS that I mentioned, only StarRocks does performant JOINS at scale. Read more about how StarRocks implements at https://docs.starrocks.io/en-us/2.5/introduction/Features
More info at https://github.com/alberttwong/databasecomparison