What is missing in AWS EMR and Databricks. Autoscaling and …

2 min readMar 18, 2025

If you’re using AWS EMR for a while, you’ll notice that there are sub-optmizations. Here’s a short list.

  • Autoscaling: It’s scales by VM and triggers off of CPU utilization. This is very crude and leads to cost waste.
  • Multi-plexing: Jobs by default run in isolation. Great that jobs don’t interfere with each other but this means that you can’t get max performance.
  • Spark job prioritization: All things are equal. Great for simplicity but horrible when certain jobs are more important than others.
  • SIMD Vectorized Parquet Reader and Writer: This makes sense, but there isn’t any point since an unoptimized version drives resource utilization.

All these items are fixable. It just takes engineering to build it and maintain it.

Here are some ideas that have been debated but haven’t shown up in AWS EMR or Databricks.

  • Autoscaling: Scale based on workload profile. This means more efficient scaling up and down, which means lower costs.
  • Multi-plexing: Provide a way to multi-plex your Apache Spark code with no modification. This can give 10x more performance for Spark which means jobs run and complete faster, which reduces costs.
  • Spark job prioritization: Allow an easy way to assign jobs to priority.

If you want to see a platform that implements these features and goes into detail on what they did, check out Onehouse Compute Runtime (https://www.onehouse.ai/blog/introducing-onehouse-compute-runtime-to-accelerate-lakehouse-workloads-across-all-engines).

Onehouse Compute Runtime

--

--

Albert Wong
Albert Wong

Written by Albert Wong

#eCommerce #Java #Database #k8s #Automation. Hobbies: #BoardGames #Comics #Skeet #VideoGames #Pinball #Magic #YelpElite #Travel #Candy

No responses yet