⚡ 2nd Edition — Updated for Spark 4

High Performance Spark

Best Practices for Scaling and Optimizing Apache Spark

O'Reilly Media  ·  2nd Edition  ·  Now covering Spark 4

By Holden Karau, Adi Polak & Rachel Warren

High Performance Spark 2e Book Cover
1.6k+
Stars on spark-testing-base
530+
GitHub Stars on Examples
Spark 4
Fully Updated Coverage
O'Reilly
Publisher

Master Spark Performance at Scale

Apache Spark is the dominant framework for large-scale data processing, but writing fast Spark code requires deep knowledge of its internals, execution model, and configuration. This book gives you that knowledge.

Now fully updated for Apache Spark 4, this second edition covers the latest APIs, performance improvements, and best practices. Whether you're optimizing existing pipelines, designing new ones from scratch, or tuning a production cluster, High Performance Spark provides battle-tested patterns and practical advice grounded in real-world experience.

Pre-order Now →
  • How Spark's execution model, DAG scheduler, and memory management work
  • When to use RDDs, DataFrames, Datasets, and Spark SQL — and why it matters
  • MLlib and machine learning performance patterns
  • Efficient joins, shuffles, and data skew mitigation strategies
  • Serialization, caching, and memory tuning for large workloads
  • Resource management (including accelerators like GPUs), and effective dynamic allocation
  • Testing, monitoring, and debugging Spark applications
  • Integration with modern data lakehouse patterns, with a focus on Iceberg
  • Python (PySpark) and Scala performance trade-offs
  • Structured Streaming best practices for low-latency pipelines (completely new in this edition)

Everything You Need to Tune Spark

Spark Internals

Deep dives into the DAG scheduler, task execution, shuffle mechanics, and memory model to help you reason about performance from first principles.

DataFrames & Datasets

Leverage Catalyst optimizer hints, avoid common pitfalls, and choose the right abstraction for type-safe, high-performance transformations.

Structured Streaming

Build reliable, low-latency streaming pipelines with micro-batch and continuous processing, stateful operations, and exactly-once semantics.

Joins & Shuffles

Master broadcast joins, sort-merge joins, skew handling, and partition strategies to eliminate the most common sources of slowness.

MLlib & ML Pipelines

Build scalable machine learning pipelines, tune hyperparameters efficiently, and deploy models into production Spark workloads.

Testing

The least performant Spark job is one which fails at runtime and needs to be re-run again. You'll learn how to take your pile of sometimes working jobs sleep through the night.

Written by Spark Experts

Holden Karau

Holden is a transgender Canadian software engineer and prolific open-source contributor. She is an Apache Spark PMC member and committer, co-author of Learning Spark (1e), and creator of spark-testing-base (1.6k+ stars), co-founder of fight health insurance, a generative AI tool to appeal health insurance denials. She has worked on big data infrastructure at Google, Apple, Amazon, Netflix, IBM, and other companies.

Adi Polak

Adi is a software engineer and developer advocate with deep expertise in distributed systems and Apache Spark at Confluent. She is an active contributor to the Spark ecosystem and speaks internationally on big data, cloud-native engineering, and developer productivity.

Rachel Warren

Rachel is a PhD student at the Donald Bren School of Information and Computer Sciences at the University of California, Irvine (UCI), where she is advised by Paul Dourish and Melissa Mazmanian. She uses ethnographic and participatory design methods to study predictive systems and data technologies in the public sector.

Prior to her graduate work, Rachel worked as a machine learning engineer and data scientist in industry. She has a Master's of Information Systems from the University of California, Berkeley School of Information and a BA in Computer Science from Wesleyan University. She was a 2024 summer intern with danah boyd in the Social Media Collective at Microsoft Research.


Learn by Running Real Code

All examples from the book are available as open-source repositories. Clone them, run them, and experiment — the best way to internalize performance concepts is to measure them yourself.

View All Repos on GitHub →

Ready to Level Up Your Spark Skills?

Pre-order High Performance Spark, 2nd Edition — now updated for Spark 4 — the definitive guide to writing fast, production-ready Apache Spark applications, by Holden Karau, Adi Polak & Rachel Warren. Slay the the OOM dragon 🐉 in your cluster once and for all (or you know at least for several months).