Full description not available
J**A
Well organized and solid information
It was easy to follow the book. The setup of Spark shell was also clearly written. I also find the instructions online to install spark locally to be sufficient as well. The book is well organized to delineate different components of Spark, e.g. intro, structured api, streaming, optimizations, data lake, ml deployment options. While ML deployment needs for individual business use cases are highly specific, I find the overview deployment framework provided by the book to be helpful. I also liked that the book uses screenshots of Spark UI and arrows to point in the screenshots to explain the UI, since the UI can be hard to understand. The code samples and the graphics in other sections are useful as well. There’s also coverage on how to connect to different apps, like beeline (which I’ve never heard of), tableau, thrift. Overall, the book contains solid information on the inner workings of Spark. I would recommend giving this book a read!
A**Z
Covers theoretical and practical aspects of the spark ecosystem in great depth
This book is a great resource to learn about spark. It covers in detail the concepts related to the Spark architecture, theoretical concepts about parallelization and topics related to optimizing analytical pipelines running on Spark. The book has a very nice section about the delta lake. Also covers MLflow yup a good level of detail, more like a complement to the docs. The section on machines learning includes theoretical explanations on how some ML algorithms change when running then parallely, as MLlib does.I used the book as an extra study resource when taking some Databricks certifications. It was a great addition to my study materials.
S**E
Decent introduction to Spark
I am always trying to learn new skills to make myself more marketable in the work place. My background is mainly in SQL with some Python and I am learning JS right now. I decided to give this book a shot to see whether Spark is another tool I want to add to my arsenal. The books does what it promises; it gives you a good introduction to Spark. I did have some issues installing the required programs on a MacBook, but once I had everything installed, I was able to follow along. My big complaint is what others have mentioned, which is concepts are mentioned without any background to what or why.If you have some programming background, this book should be sufficient to get you up and running in Spark.
C**S
Buen libro para iniciarse en spark
Da buenos ejemplos sea en Scala y python aunque no siempre están en python el lenguaje Scala es similar (como un Java python). Sugiere que si quieres practicar utiliza databricks si no quieres instalar nada on-premise o si gusta instala spark utilizando wsl de Windows o una máquina virtual con Linux.
M**D
Must read
This book is a must read for anyone trying to learn Spark in the big data environment.
A**R
More databricks centric
Nice book if you really want to work hands on without having to worry about internals of spark.
E**G
Best introductory Spark guide as of early-2021
The foreword and preface to this book comment that an update to the first edition, published in 2015, was long overdue. After all, the first edition makes use of Apache Spark 1.3.0, whereas this update makes use of Apache Spark 3.0.0-preview2 (the latest version available at the time of writing). For the most part, I successfully ran all notebook code out of the box using Databricks Runtime 7.6 ML (includes Apache Spark 3.0.1, Scala 2.12), albeit minor issues are explained later in this review alongside my resolutions to these. I was, however, able to successfully run all standalone PySpark applications from chapters #2 and #3 out of the box using Apache Spark 3.0.1 and Python 3.7.9. As explained, the approach used here is intended to be conductive to hands-on learning, but with a focus on Spark's Structured APIs, so there are a few topics that aren't covered, such as the following: the older low-level Resilient Distributed Dataset (RDD) APIs, GraphX (Spark's API for graphs and graph-parallel computation), how to extend Spark's Catalyst optimizer, how to implement your own catalog, and how to write your own DataSource V2 data sinks and sources.Content is broken down into 12 chapters: (1) "Introduction to Apache Spark: A Unified Analytics Engine", (2) "Downloading Apache Spark and Getting Started", (3) "Apache Spark's Structured APIs", (4) "Spark SQL and DataFrames: Introduction to Built-in Data Sources", (5) "Spark SQL and DataFrames: Interacting with External Data Sources", (6) "Spark SL and Datasets", (7) "Optimizing and Tuning Spark Applications", (8) "Structured Streaming", (9) "Building Reliable Data Lakes with Apache Spark", (10) "Machine Learning with MLlib", (11) "Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark", and (12) "Epilogue: Apache Spark 3.0". The longest chapter is chapter #8, followed closely behind by chapters #3, #4, #5, and #10, and the most notebooks are provided for chapters #10 and #11, although this is largely due to individual notebooks dedicated to a variety of topics. This book is the fourth of four related books I've worked through, a couple years after the earlier three: "Spark: The Definitive Guide" (2018), "High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark" (2017), and "Practical Hive: A Guide to Hadoop's Data Warehouse System" (2016). As I mentioned in an earlier review, if you are new to Apache Spark, these four texts will help enable your going in the right direction, although keep in mind that the related tech stack is evolving and you will obviously need to supplement this material with web documentation and developer forums, as well as to get hands-on with the tooling. Reading the earlier three books in opposite order of publication date enabled exposure to more current material sooner rather than later, but this was largely just a coincidence. Now that this new book is available, I recommend working through this one first. While I wouldn't discount "Spark: The Definitive Guide", because it provides content not in this new book and I personally think it flows better, use it very judiciously because it was created using the Spark 2.0.1 APIs.The only notebooks I wasn't able to successfully run out of the box are constrained to chapter #11. In notebook 11-3 ("Distributed Inference"), 11-5 ("Joblib"), and 11-7 ("Koalas"), FileNotFoundErrors were generated when attempting to use Pandas to read from CSV or Parquet files using "read_csv()" and "read_parquet()", respectively. In taking a look at what the community had to say, I discovered that this is a known issue, so I replaced these Pandas statements with "spark.read.option(...).csv("...") and "spark.read.option(...).parquet("...") instead, respectively, subsequently converting to Pandas using "toPandas()". In looking at the documentation, Pandas 1.0.1 is installed on both the CPU and GPU clusters for the aforementioned Databricks Runtime (the latest non-beta currently available). In notebook 11-3 ("Distributed Inference"), the following PythonException was generated when attempting to execute a "mapInPandas()" statement that uses a mix of numeric data types in the schema argument: "pyarrow.lib.ArrowInvalid: Could not convert 3.0 with type str: tried to convert to double". In the absence of decent community guidance, and because this statement is solely used for display purposes, I simply converted all of these data types to "STRING". According to the documentation, Pyarrow 1.0.1 is installed on both the CPU and GPU clusters for the aforementioned Databricks Runtime. I personally got the most value out of chapters #7 and #8. Chapter #7 covers optimizing and tuning Spark for efficiency, caching and persistence of data, Spark joins, and inspecting the Spark UI. Chapter #8 covers evolution of the Apache Spark stream processing engine, the programming model of Structured Streaming, the fundamentals of a Structured Streaming query, streaming data sources and sinks, data transformations, stateful streaming aggregations, streaming joins, arbitrary stateful computations, and performance tuning. In particular, I especially appreciated the sections on the two most common Spark join strategies (the broadcast hash join and shuffle sort merge join), the Spark UI, stateful streaming aggregations, and streaming joins. Well recommended for anyone making use of Spark.
Trustpilot
1 month ago
2 weeks ago