Data Science Workshops

← All posts by Jeroen

Scrape HTML Elements Across Paginated Content in R and Rvest

Jeroen Janssens
Nov 5, 2021 • 5 min read

There’s something exciting about scraping a website to build your own dataset! For R, there’s the rvest package to harvest (i.e., scrape) static HTML.

When the HTML elements you’re interested in are spread across multiple pages and you know the URLs of the pages up front (or you know how many pages you need to visit and the URLs are predictable), you can most likely use a for loop or one of the map functions from the purrr package. For example, to get the Stack Overflow questions tagged R from the first three pages, you could do:

library(purrr)
library(rvest)

(urls <- stringr::str_c("https://stackoverflow.com/questions/",
"tagged/r?tab=votes&page=", 1:3))
## [1] "https://stackoverflow.com/questions/tagged/r?tab=votes&page=1"
## [2] "https://stackoverflow.com/questions/tagged/r?tab=votes&page=2"
## [3] "https://stackoverflow.com/questions/tagged/r?tab=votes&page=3"

map(urls,
~ read_html(.) %>%
html_elements("h3 > a.question-hyperlink") %>%
html_text()) %>%
flatten_chr() %>%
head(n = 10)
## [1] "How to make a great R reproducible example"
## [2] "How to sort a dataframe by multiple column(s)"
## [3] "How to join (merge) data frames (inner, outer, left, right)"
## [4] "Grouping functions (tapply, by, aggregate) and the *apply family"
## [5] "Remove rows with all or some NAs (missing values) in data.frame"
## [6] "Drop data frame columns by name"
## [7] "How do I replace NA values with zeros in an R dataframe?"
## [8] "What are the differences between \"=\" and \"<-\" assignment operators in R?"
## [9] "data.table vs dplyr: can one do something well the other can't or does poorly?"
## [10] "Rotating and spacing axis labels in ggplot2"

However, if you don’t necessarily know how many pages you need to visit or the URLs are not easily generated up front, but there’s a link to the next page, something like this function has served (or scraped) me well:

html_more_elements <- function(session, css, more_css) {
xml2:::xml_nodeset(c(
html_elements(session, css),
tryCatch({
html_more_elements(session_follow_link(session, css = more_css),
css, more_css)
}, error = function(e) NULL)
))
}

This R function uses several functions from the rvest package and recursion to select HTML elements across multiple pages[1]. It has three arguments:

  1. A session object created by rvest::session()
  2. A CSS selector that identifies the elements you want to select from each page
  3. A CSS selector that identifies the link to the next page

Note that this function only stops either when there’s no more links to follow or when the server replies with an error.

Here’s an example that scrapes the names of all Lego Star Wars sets:

lego_sets <-
session("https://www.lego.com/en-us/themes/star-wars") %>%
html_more_elements("li h2 > span", "a[rel=next]") %>%
html_text()
## Navigating to /en-us/themes/star-wars?page=2
## Navigating to /en-us/themes/star-wars?page=3
## Navigating to /en-us/themes/star-wars?page=4
## Navigating to /en-us/themes/star-wars?page=5

length(lego_sets)
## [1] 89

head(lego_sets, n = 10)
## [1] "Republic Gunship™" "Imperial Star Destroyer™"
## [3] "Mos Eisley Cantina™" "R2-D2™"
## [5] "Imperial Light Cruiser™" "The Bad Batch™ Attack Shuttle"
## [7] "Darth Vader™ Helmet" "Darth Vader™ Meditation Chamber"
## [9] "Imperial Shuttle™" "Imperial Probe Droid™"

Here’s another example that selects all the titles from Hacker News and shows the first 10:

session("https://news.ycombinator.com") %>%
html_more_elements(".titlelink", ".morelink") %>%
html_text() %>%
head(n = 10)
## Navigating to news?p=2
## Navigating to news?p=3
## Navigating to news?p=4
## Navigating to news?p=5
## Warning in session_set_response(x, resp): Service Unavailable (HTTP 503).
## [1] "ADSL works over wet string (2017)"
## [2] "How to Use Dig"
## [3] "LLVM internals, part 4: attributes and attribute groups"
## [4] "The Field of Longevity Biotech Is a Mess"
## [5] "Three Ways to Debug Code in Elixir"
## [6] "Is the big tech era ending?"
## [7] "FirefoxPWA: Progressive Web Apps for Firefox"
## [8] "Show HN: I wrote a book about using Lambda with Go"
## [9] "A no-reload HTML/CSS/JS playground with instant editor ↔ output Sync℠"
## [10] "Bosch gives go-ahead for volume production of silicon carbide chips"

Note that I’m getting a 503 after a couple of pages. That’s probably because I’m making too many requests in too little time. Adding some delay to the function (with, e.g., Sys.sleep(1)) would solve this. Remember, always Scrape Responsibly™.

— Jeroen


  1. There’s a discussion on GitHub about whether it makes sense to add this functionality to rvest. ↩︎

About Jeroen

Jeroen Janssens, PhD, is a data science consultant and certified instructor. His expertise lies in visualizing data, implementing machine learning models, and building solutions using Python, R, JavaScript, and Bash. He’s passionate about helping and teaching others to do such things.

Since 2013, Jeroen runs Data Science Workshops, a training and coaching firm that organizes open enrollment workshops, in-company courses, inspiration sessions, hackathons, and meetups. Clients include Amazon, eHealth Africa, Schiphol Airport, The New York Times, and T-Mobile.

Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. He is the author of Data Science at the Command Line (O’Reilly Media, 2021). Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University.

He lives with his wife and two kids in Rotterdam, the Netherlands.
If you would like to know more about his services, fees, and availability, then please email Jeroen. You can also find him on Twitter, GitHub, and LinkedIn.

Read more...

Subscribe to my newsletter

Stay up-to-date about new workshops, upcoming events, and other news about myself and Data Science Workshops.