Data Science with Python and Spark

Process big data and apply machine learning at scale with Apache Spark and Python.


Apache Spark is an open-source distributed engine for querying and processing data. In this two-day hands-on workshop, you will learn how to leverage Spark from Python to process large amounts of data.

After a presentation of the Spark architecture, we’ll begin manipulating Resilient Distributed Datasets (RDDs) and work our way up to Spark DataFrames. The concept of lazy execution is discussed in detail and we demonstrate various transformations and actions specific to RDDs and DataFrames. You’ll learn how DataFrames can be manipulated using SQL queries.

We’ll show you how to apply supervised machine learning models such as linear regression, logistic regression, decision trees, and random forests. You’ll also see unsupervised machine learning models such as PCA and K-means clustering.

By the end of this workshop, you will have a solid understanding of how to process data using PySpark and you will understand how to use Spark’s machine learning library to build and train various machine learning models.

About your instructor

Jeroen Janssens
Principal Instructor, Data Science Workshops

Jeroen is an RStudio Certified Instructor who enjoys visualizing data, building machine learning models, and automating things using either Python, R, or Bash. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. He is the author of Data Science at the Command Line. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

What you’ll learn

  • Learn about Apache Spark and the Spark architecture and its components
  • Work with RDDs and lazy evaluation
  • Build and interact with Spark DataFrames using Spark SQL
  • Use Spark SQL and DataFrames to process data using traditional SQL queries
  • Apply a spectrum of supervised and unsupervised machine learning algorithms
  • Handle issues related to feature engineering, class imbalance, bias and variance, and cross validation for building an optimal fit model

This workshop is for you because

  • You work with data regularly and want to be able to scale up the quantity of data processed
  • You want to understand the methods specific to Spark for wrangling data
  • You want to learn how to apply machine learning algorithms to large amounts of data


Day 1:

  • Introduction to Apache Spark
    • Setting up Spark
    • Spark fundamentals
    • Spark architecture
  • Resilient Distributed Datasets (RDDs)
    • Getting data into Spark
    • Actions
    • Transformations
  • Spark DataFrames
    • Speeding up Spark with DataFrames
    • Creating DataFrames
    • Interoperating with RDDs
    • Working with the DataFrame API
    • Applying SQL to Spark DataFrames

Day 2:

  • ML and MLLib packages
    • API Overview
    • Transformers
    • Estimators
    • Pipelines
  • Applying Machine Learning
    • Model selection
    • Cross validation
    • Tuning
    • Classification
    • Regression
    • Recommender system
  • Where to go from here


We’ve previously delivered this workshop at:

KPN ICT Consulting


Participants are expected to be familiar with the following Python syntax and concepts:

  • assignment, arithmetic, boolean expression, tuple unpacking
  • bool, int, float, list, tuple, dict, str, type casting
  • in operator, indexing, slicing
  • if, elif, else, for, while
  • range(), len(), zip()
  • def, (keyword) arguments, default values
  • import, import as, from import ...
  • lambda functions, list comprehension
  • JupyterLab or Jupyter Notebook

Some experience with Pandas and SQL is useful, but not required.

Participants are kindly requested to have the following items installed prior to the start of the workshop:

More detailed installation instructions will be provided by email after signup.

Photos and testimonials

Laurens Koppenol
Lead Data Scientist, ProRail

Our DataLab team enjoyed a three-day PySpark course from Jeroen. Jeroen’s approach is personal and professional. I recommend Data Science Workshops to anyone in the field of data science.

Sign up

This hands-on workshop hasn’t been scheduled yet, but we’d happily organise one for your team.