Spark in Action, Second Edition
- Paperback: 576 pages
- Publisher: WOW! eBook; 2nd edition (May 17, 2020)
- Language: English
- ISBN-10: 1617295523
- ISBN-13: 978-1617295522
Spark in Action, 2nd Edition: Covers Apache Spark 3 with Examples in Java, Python, and Scala
The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, 2nd Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop.
Unlike many Spark books written for data scientists, Spark in Action, Second Edition is designed for data engineers and software engineers who want to master data processing using Spark without having to learn a complex new ecosystem of languages and tools. You’ll instead learn to apply your existing Java and SQL skills to take on practical, real-world challenges.
Spark is a powerful general-purpose analytics engine that can handle massive amounts of data distributed across clusters with thousands of servers. Optimized to run in memory, this impressive framework can process data up to 100x faster than most Hadoop-based systems. Spark’s support for SQL, along with its ability to rapidly run repeated queries and quickly adapt to modified queries, make it well-suited for machine learning, so important in this age of big data. Whether you’re using Java, Scala, or Python, Spark offers straightforward APIs to access its core features.
Spark in Action, 2nd Edition is an entirely new book that teaches you everything you need to create end-to-end analytics pipelines in Spark. Rewritten from the ground up with lots of helpful graphics, you’ll learn the roles of DAGs and dataframes, the advantages of “lazy evaluation”, and ingestion from files, databases, and streams.
By working through carefully-designed Java-based examples, you’ll delve into Spark SQL, interface with Python, and cache and checkpoint your data. Along the way, you’ll learn to interact with common enterprise data technologies like HDFS and file formats like Parquet, ORC, and Avro.
- Lots of examples based in the Spark Java APIs using real-life dataset and scenarios
- Examples based on Apache Spark 3.0.0
- Ingestion through files, databases, and streaming
- Building custom ingestion process
- Querying distributed datasets with Spark SQL
- Deploying Apache Spark 2.3.4 applications
- Caching and checkpointing your data
- Interfacing with data scientists using Python
- Applied machine learning
- Spark use cases including Lumeris, CERN, and IBM
You’ll also discover interesting Spark use cases, like interactive reporting, machine learning pipelines, and even monitoring players in online games. You’ll even get a quick look at machine learning techniques you can apply without a PhD in mathematics! All examples are available in GitHub for you to explore and adapt as you learn. The demand for Spark-savvy developers is so steep, they’re among the highest paid in the industry today!