⚡ 2nd Edition — Updated for Spark 4

High Performance Spark

Best Practices for Scaling and Optimizing Apache Spark

O'Reilly Media · 2nd Edition · Now covering Spark 4

By Holden Karau, Adi Polak & Rachel Warren

About the Book

Master Spark Performance at Scale

Apache Spark is the dominant framework for large-scale data processing, but writing fast Spark code requires deep knowledge of its internals, execution model, and configuration. This book gives you that knowledge.

Now fully updated for Apache Spark 4, this second edition covers the latest APIs, performance improvements, and best practices. Whether you're optimizing existing pipelines, designing new ones from scratch, or tuning a production cluster, High Performance Spark provides battle-tested patterns and practical advice grounded in real-world experience.

Pre-order Now →

What You'll Learn

How Spark's execution model, DAG scheduler, and memory management work
When to use RDDs, DataFrames, Datasets, and Spark SQL — and why it matters
MLlib and machine learning performance patterns
Efficient joins, shuffles, and data skew mitigation strategies
Serialization, caching, and memory tuning for large workloads
Resource management (including accelerators like GPUs), and effective dynamic allocation
Testing, monitoring, and debugging Spark applications
Integration with modern data lakehouse patterns, with a focus on Iceberg
Python (PySpark) and Scala performance trade-offs
Structured Streaming best practices for low-latency pipelines (completely new in this edition)

Key Topics

Everything You Need to Tune Spark

Spark Internals

Deep dives into the DAG scheduler, task execution, shuffle mechanics, and memory model to help you reason about performance from first principles.

DataFrames & Datasets

Leverage Catalyst optimizer hints, avoid common pitfalls, and choose the right abstraction for type-safe, high-performance transformations.

Structured Streaming

Build reliable, low-latency streaming pipelines with micro-batch and continuous processing, stateful operations, and exactly-once semantics.

Joins & Shuffles

Master broadcast joins, sort-merge joins, skew handling, and partition strategies to eliminate the most common sources of slowness.

MLlib & ML Pipelines

Build scalable machine learning pipelines, tune hyperparameters efficiently, and deploy models into production Spark workloads.

Testing

The least performant Spark job is one which fails at runtime and needs to be re-run again. You'll learn how to take your pile of sometimes working jobs sleep through the night.

About the Authors

Written by Spark Experts

Holden Karau

Holden is a transgender Canadian software engineer and prolific open-source contributor. She is an Apache Spark PMC member and committer, co-author of Learning Spark (1e), and creator of spark-testing-base (1.6k+ stars), co-founder of fight health insurance, a generative AI tool to appeal health insurance denials. She has worked on big data infrastructure at Google, Apple, Amazon, Netflix, IBM, and other companies.

⌥ GitHub in LinkedIn 𝕏 @holdenkarau

Adi Polak

Adi is a software engineer and developer advocate with deep expertise in distributed systems and Apache Spark at Confluent. She is an active contributor to the Spark ecosystem and speaks internationally on big data, cloud-native engineering, and developer productivity.

⌥ GitHub in LinkedIn

Rachel Warren

Rachel is a PhD student at the Donald Bren School of Information and Computer Sciences at the University of California, Irvine (UCI), where she is advised by Paul Dourish and Melissa Mazmanian. She uses ethnographic and participatory design methods to study predictive systems and data technologies in the public sector.

Prior to her graduate work, Rachel worked as a machine learning engineer and data scientist in industry. She has a Master's of Information Systems from the University of California, Berkeley School of Information and a BA in Computer Science from Wesleyan University. She was a 2024 summer intern with danah boyd in the Social Media Collective at Microsoft Research.

⌥ GitHub in LinkedIn

Get Your Copy

Ready to Level Up Your Spark Skills?

Pre-order High Performance Spark, 2nd Edition — now updated for Spark 4 — the definitive guide to writing fast, production-ready Apache Spark applications, by Holden Karau, Adi Polak & Rachel Warren. Slay the the OOM dragon 🐉 in your cluster once and for all (or you know at least for several months).

🛒 Pre-order on Amazon 📖 Read on O'Reilly

📦 Paperback — Amazon 📱 Kindle Edition