How to Scrape Multiple Pages in R and Rvest

Blog article by Jeroen Janssens
Nov 5, 2021 • 7 min read

There’s something exciting about scraping a website to build your own dataset! For R, there’s the rvest package to harvest (i.e., scrape) static HTML.

When the HTML elements you’re interested in are spread across multiple pages and you know the URLs of the pages up front (or you know how many pages you need to visit and the URLs are predictable), you can most likely use a for loop or one of the map functions from the purrr package. For example, to get the Stack Overflow questions tagged R from the first three pages, you could do:

library(purrr)
library(rvest)

(urls <- stringr::str_c("https://stackoverflow.com/questions/",
                        "tagged/r?tab=votes&page=", 1:3))
## [1] "https://stackoverflow.com/questions/tagged/r?tab=votes&page=1"
## [2] "https://stackoverflow.com/questions/tagged/r?tab=votes&page=2"
## [3] "https://stackoverflow.com/questions/tagged/r?tab=votes&page=3"

map(urls,
    ~ read_html(.) %>%
      html_elements("h3 > a.s-link") %>%
      html_text()) %>%
  flatten_chr() %>%
  head(n = 10)
##  [1] "How to make a great R reproducible example"                                    
##  [2] "How to join (merge) data frames (inner, outer, left, right)"                   
##  [3] "Sort (order) data frame rows by multiple columns"                              
##  [4] "Grouping functions (tapply, by, aggregate) and the *apply family"              
##  [5] "Remove rows with all or some NAs (missing values) in data.frame"               
##  [6] "Drop data frame columns by name"                                               
##  [7] "How do I replace NA values with zeros in an R dataframe?"                      
##  [8] "What are the differences between \"=\" and \"<-\" assignment operators?"       
##  [9] "data.table vs dplyr: can one do something well the other can't or does poorly?"
## [10] "Rotating and spacing axis labels in ggplot2"

However, if you don’t necessarily know how many pages you need to visit or the URLs are not easily generated up front, but there’s a link to the next page, something like this function has served (or scraped) me well:

html_more_elements <- function(session, css, more_css) {
  xml2:::xml_nodeset(c(
    html_elements(session, css),
    tryCatch({
      html_more_elements(session_follow_link(session, css = more_css),
                         css, more_css)
    }, error = function(e) NULL)
  ))
}

This R function uses several functions from the rvest package and recursion to select HTML elements across multiple pages^[1]. It has three arguments:

A session object created by rvest::session()
A CSS selector that identifies the elements you want to select from each page
A CSS selector that identifies the link to the next page

Note that this function only stops either when there’s no more links to follow or when the server replies with an error.

Here’s an example that scrapes the names of all Lego Star Wars sets:

lego_sets <-
  session("https://www.lego.com/en-us/themes/star-wars") %>%
  html_more_elements("li h2 > span", "a[rel=next]") %>%
  html_text()
## Navigating to /en-us/themes/star-wars?page=2
## Navigating to /en-us/themes/star-wars?page=3
## Navigating to /en-us/themes/star-wars?page=4
## Navigating to /en-us/themes/star-wars?page=5
## Navigating to /en-us/themes/star-wars?page=6

length(lego_sets)
## [1] 93

head(lego_sets, n = 10)
##  [1] "The Razor Crest™"                   "AT-TE™ Walker"                     
##  [3] "AT-AT™"                             "LEGO® Star Wars™ Advent Calendar"  
##  [5] "Clone Trooper™ Command Station"     "Millennium Falcon™"                
##  [7] "Republic Fighter Tank™"             "R2-D2™"                            
##  [9] "Republic Gunship™"                  "The Mandalorian's N-1 Starfighter™"

Here’s another example that selects all the titles from Hacker News and shows the first 10:

session("https://news.ycombinator.com") %>%
  html_more_elements(".titleline", ".morelink") %>%
  html_text() %>%
  head(n = 10)
## Navigating to news?p=2
## Navigating to news?p=3
## Navigating to news?p=4
## Navigating to news?p=5
## Warning in session_set_response(x, resp): Service Unavailable (HTTP 503).
##  [1] "WikiLeaks is struggling to stay online as millions of documents disappear (dailydot.com)"          
##  [2] "Japanese have been producing wood for 700 years without cutting down trees (dsfantiquejewelry.com)"
##  [3] "Someone has to say it: Voice assistants are not doing it for big tech (theregister.com)"           
##  [4] "Help seed Z-Library on IPFS (annas-blog.org)"                                                      
##  [5] "The Carcinization of Go Programs (xeiaso.net)"                                                     
##  [6] "The miracle of Smalltalk’s become: (2009) (gbracha.blogspot.com)"                                  
##  [7] "Building the fastest Lua interpreter automatically (sillycross.github.io)"                         
##  [8] "Heterogeneous-Memory Storage Engine (hse-project.github.io)"                                       
##  [9] "Safely writing code that isn't thread-safe (cliffle.com)"                                          
## [10] "A history of ARM, part 2: Everything starts to come together (arstechnica.com)"

Note that I’m getting a 503 after a couple of pages. That’s probably because I’m making too many requests in too little time. Adding some delay to the function (with, e.g., Sys.sleep(1)) would solve this. Remember, always Scrape Responsibly™.

— Jeroen

There’s a discussion on GitHub about whether it makes sense to add this functionality to rvest. ↩︎

Would you like to receive an email whenever I have a new blog post, organize an event, or have an important announcement to make? Sign up to my newsletter: