Data Analysis with Python and Pandas

Import, manipulate, explore, and visualise data with Pandas, NumPy, and Jupyter Lab.

Introduction

Learn how to accelerate your data analyses using Pandas, a Python library specifically designed for working with medium-sized data sets. Together with JupyterLab it enables a convenient environment for interactive data analysis.

Pandas is part of the so-called PyData ecosystem, and in this workshop we’ll start by providing an overview of PyData and explain where Pandas stands and how it interacts with other libraries such as NumPy and Seaborn. Pandas introduces a few new data structures, most importantly the DataFrame, which are essential to understand how to work with tabular data efficiently.

Pandas offers many features, and in one day, through a good balance of presentation and interactive exercises, we’re going to cover the most important ones, including: importing, filtering, grouping, joining, exploring, and visualising data. By the end of this workshop, you’ll understand the fundamentals of Pandas, be aware of common pitfalls, and be ready to perform your own analyses.

What you’ll learn

  • Load data from text files, spreadsheets, databases, and APIs
  • Use the Split-Apply-Combine paradigm to summarise data
  • Performing advanced joins and merges
  • Generating insightful pivot tables
  • Transforming data between wide and long formats
  • Working with time series data
  • Explore data using a variety of visualisation types
  • Avoid common pitfalls in NumPy and Pandas by understanding general concepts and principles

This workshop is for you because

  • You have experience in Excel or R and want to learn about pandas and the PyData ecosystem
  • You have programming experience in Python and want to start analysing data using pandas
  • You want to improve your understanding of Pandas by learning general concepts and principles common to most data manipulation frameworks

Schedule

  • Overview of the PyData ecosystem
    • NumPy, SciPy, Pandas
    • Matplotlib, Seaborn, Bokeh
    • SciKit-Learn
  • Essential data structures
    • Numpy data types
    • Numpy arrays
    • Pandas Series
    • Pandas DataFrames
    • Pandas Index, MultiIndex
  • Importing data
    • From CSV
    • From Excel
    • From Databases
    • From APIs
  • Manipulating data
    • Selecting rows and columns
    • Filtering rows
    • Joining and concatenating
    • Missing values, duplicates
    • Converting data types
    • Dates and times
    • Working with categorical data
    • String manipulation
  • Exploring data
    • Computing aggregate statistics
    • Pivot tables
    • Correlations
  • Visualising data
    • Histogram
    • Densityplot
    • Boxplot
    • Bar chart

Prerequisites

You’re expected to have some experience with programming in Python. Our workshop Introduction to Programming in Python is one option that can help you with that. Roughly speaking, if you’re familiar with the following Python syntax and concepts, then you’ll be fine:

  • assignment, arithmetic, boolean expression, tuple unpacking
  • bool, int, float, list, tuple, dict, str, type casting
  • in operator, indexing, slicing
  • if, elif, else, for, while
  • range(), len(), zip()
  • def, (keyword) arguments, default values
  • import, import as, from import ...
  • lambda functions, list comprehension
  • JupyterLab or Jupyter Notebook

We’re going to use Python together with JupyterLab and the following packages:

  • numpy
  • pandas
  • seaborn

The recommended way to get everything set up is to download and install the Anaconda Distribution.

Alternatively, if you don’t want to use Anaconda, then you can install everything using pip. In any case, if running import pandas, seaborn doesn’t produce any errors then you know you’ve set up everything correctly.

About your instructor

Jeroen Janssens
Principal Instructor, Data Science Workshops

Jeroen is an RStudio Certified Instructor who enjoys visualizing data, building machine learning models, and automating things using either Python, R, or Bash. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. He is the author of Data Science at the Command Line. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

Clients

We’ve previously delivered this workshop at:

Vocalink
Jheronimus Academy of Data Science
Brabant Water
Textkernel
Transavia

Photos and testimonials

Karlijn Dinnissen
Data Quality Analyst, Textkernel

Attending the bespoke course Data Munging with Pandas at Textkernel has proven to be an excellent choice. Jeroen’s personal approach and highly interactive way of teaching made this course valuable to a diverse group of developers and analysts, as did the possibility to apply theory on our own data and API during the courses. I’ve since been able to code cleaner and more efficient, and applied the pandas package in several monitoring and analytics scripts.

Stijn de Jong
Advisor Water Supply, Brabant Water

At Brabant Water, most of us were still using spreadsheets to clean, analyse, and model our data. Thanks to Data Science Workshops, who delivered an engaging, hands-on workshop at our office, many of us have switched to Python and Jupyter Notebook, which allows our analyses to be much more advanced and reliable.

Sign up

Two upcoming dates: