Best Practices for Scaling and Optimizing Apache Spark
O'Reilly Media · 2nd Edition · Now covering Spark 4
Apache Spark is the dominant framework for large-scale data processing, but writing fast Spark code requires deep knowledge of its internals, execution model, and configuration. This book gives you that knowledge.
Now fully updated for Apache Spark 4, this second edition covers the latest APIs, performance improvements, and best practices. Whether you're optimizing existing pipelines, designing new ones from scratch, or tuning a production cluster, High Performance Spark provides battle-tested patterns and practical advice grounded in real-world experience.
Pre-order Now →Deep dives into the DAG scheduler, task execution, shuffle mechanics, and memory model to help you reason about performance from first principles.
Leverage Catalyst optimizer hints, avoid common pitfalls, and choose the right abstraction for type-safe, high-performance transformations.
Build reliable, low-latency streaming pipelines with micro-batch and continuous processing, stateful operations, and exactly-once semantics.
Master broadcast joins, sort-merge joins, skew handling, and partition strategies to eliminate the most common sources of slowness.
Build scalable machine learning pipelines, tune hyperparameters efficiently, and deploy models into production Spark workloads.
The least performant Spark job is one which fails at runtime and needs to be re-run again. You'll learn how to take your pile of sometimes working jobs sleep through the night.
All examples from the book are available as open-source repositories. Clone them, run them, and experiment — the best way to internalize performance concepts is to measure them yourself.
Complete working code examples from the book covering RDDs, DataFrames, Datasets, Spark SQL, Structured Streaming, MLlib, and more. Includes Scala, Python, and Java implementations with benchmark utilities.
Holden's popular testing library for Apache Spark. Provides base classes and utilities for writing reliable, fast Spark unit and integration tests — an essential companion to the testing chapters.
Tooling to automate upgrading to new versions of Spark, along with use of the write-audit-publish (WAP) pattern to validate correctness. Done in Python, Scala, and SQL.
Random misc utils for those fiddly bits not in Spark core.
Uses Spark to build the training data for fine-tuning Fight Health Insurance, a tool for appealing insurance denials.
A proof-of-concept auto-tuner for Apache Spark that automatically discovers good configuration parameters for your specific workload — a practical companion to the cluster tuning chapters.
Need to share the excitement? We've got you covered:
Pre-order High Performance Spark, 2nd Edition — now updated for Spark 4 — the definitive guide to writing fast, production-ready Apache Spark applications, by Holden Karau, Adi Polak & Rachel Warren. Slay the the OOM dragon 🐉 in your cluster once and for all (or you know at least for several months).