Photo by Markus Spiske on Unsplash

How to web-scrape data from multiple pages

My outcome from an online course: web-scrape the Destatis data release calendar!

Through my involvement in Correlaid Rhein-Main I received a license to Dataquest so I took the opportunity to learn some new tricks! While I skipped most of the beginner classes of the “Data Analyst in R” path, I thought the two courses on APIs and web-scraping might be useful: so far my use of APIs at work mostly was done via packages (or, as in the Bee Observer project, prepared by somebody else), so I was interested to know what’s behind these GET and POST commands. While I still had some good learning experiences, I still had the impression that the course had some teething problems and many things were not well explained or described. I will still try to share my comments with the course developers so that hopefully they will improve it for future users - I had done so already via the community page, but apparently this was not the right place… 😒
Anyhow… the web-scraping course was luckily much better in my opinion! 😃
It teaches well how to use the rvest package and search for the right html selectors using the inspect tool of your web browser and the selector gadget for Google Chrome. After that I wanted to immediately jump in and find use-cases at work to implement my new knowledge. One task that came to mind is rather dull and administrative - and thus even more so a good candidate for automation. Within the process of the macroeconomic projection exercises the special team coordinating the exercises circulates well in advance of the next projection round a propose calendar outlining the timeline of forecast iterations and high-level meetings. Building on top of this general FTF (Forecast Task Force) calendar, my division, responsible for the projections for the largest five euro area countries, creates another one with more intermediate steps that are relevant for the country experts producing the forecast hands-on. As a guide for when might be the right time to work on the forecast we also enter in this calendar a column with the release dates of important macroeconomic indicators. For a long time this was a rather manual job, which could cost you a few hours to get all the data ready, oftentimes in an otherwise busy period. The statistical offices Eurostat, Istat and INE all offer internet calendars, which can be imported to Outlook, but also be read using the calendar package. The national statistical offices of the Netherlands (CBS) and Germany (Destatis), howevever, only offer their release dates as lists or tables on the website. The annual calendar of Destatis is in particular tricky, because the list is split over multiple pages.

page content split over multiple pages accessible via navigation buttons

But let’s get to that part later. First let’s just find the selector for the data releases. Using the inspect tool of the browser we see that the items we are interested in are contained within the a div of class row within a div of class c-result.

css selector for the required data

Now let’s start getting our hands dirty in R and load the rvest and tidyverse packages.

library(rvest)
library(tidyverse)

Let’s see what the output of the first page looks like. For that we use read_html on the URL of the first page, select the nodes as identified above using html_nodes and convert to text using html_text.

destatis_calendar_first <- read_html(
  "https://www.destatis.de/SiteGlobals/Forms/Suche/Termine/EN/Terminsuche_Formular.html"
) %>% 
  html_nodes(".c-result .row") %>% 
  html_text()

destatis_calendar_first
##  [1] "\n    \n      Persons employed in manufacturing\n      Reporting period: December 2020\n        Date of issue: 2021.02.16  (deadline)\n      \n  "                         
##  [2] "\n    \n      Indices of wholesale trade prices\n      Reporting period: January 2021\n        Date of issue: 2021.02.17  (deadline)\n      \n  "                          
##  [3] "\n    \n      Indices of the stock of orders and ranges in manufacturing\n      Reporting period: December 2020\n        Date of issue: 2021.02.18\n      \n  "            
##  [4] "\n    \n      Quarterly labour market statistics\n      Detailed breakdown\n\n      Reporting period: 4th quarter 2020\n        Date of issue: 2021.02.18\n      \n  "     
##  [5] "\n    \n      Insolvencies\n      Reporting period: November 2020\n        Date of issue: 2021.02.19  (deadline)\n      \n  "                                              
##  [6] "\n    \n      Producer price index of industrial products\n      Reporting period: January 2021\n        Date of issue: 2021.02.19\n      \n  "                            
##  [7] "\n    \n      Gross domestic product \n      Detailed breakdown\n\n      Reporting period: 4th quarter 2020\n        Date of issue: 2021.02.24\n      \n  "                
##  [8] "\n    \n      Maastricht deficit ratio\n      Reporting period: Year 2020\n        Date of issue: 2021.02.24\n      \n  "                                                  
##  [9] "\n    \n      Index of orders received\n      provisional release date\n\n      Reporting period: December 2020\n        Date of issue: 2021.02.25  (deadline)\n      \n  "
## [10] "\n    \n      Traffic accident\n      Reporting period: December 2020\n        Date of issue: 2021.02.25  (deadline)\n      \n  "

This output still has some issues that should be fixed, like the abundance of line breaks and whitespace, but we can leave this for later. The fact that we have a character vector of length 10 is already going in the direction that we want.

Now the question is how we can get to the next pages programatically. For this we will inspect the “next” arrow at the bottom of the page.

html attribute of the navigation button

We see that the forward button is an element with the class forward button and has as an href attribute the URL of the next page, however as a relative address (i.e. without the base URL). If you would check on the last possible page of the list, you can see that the (now greyed out) forward button now does not contain an href attribute.

So we can define a loop that starts on the first page, reads the data as well as the URL of the following page from the forward button (using html_attr("href")) and then updates the page from which to read. This loop is repeated (with the data appended to the previously scraped data) until the last page is reached where the aforementioned command will return a missing value. As discussed before, the content of the rows is a little problematic due to line breaks and spaces, but otherwise is quite structured. Using some string operations and regular expressions we can split the information into three columns, containing the name of the indicator, the reference period of the data release, and the release date.

base_url <- "https://www.destatis.de/"

destatis_calendar <- read_html(
  "https://www.destatis.de/SiteGlobals/Forms/Suche/Termine/EN/Terminsuche_Formular.html"
) 

current_page <- destatis_calendar
next_endpoint <- destatis_calendar %>% 
  html_node("li .forward.button") %>% 
  html_attr("href")

entries <- tibble(indicator = character(), 
                  reference_period = character(), 
                  release = character())

repeat{
  current_entries <- current_page %>% 
    html_nodes(".c-result") %>% 
    # I need to get to the second level div
    html_nodes("div div") %>% 
    html_text() %>% trimws() %>%
    as_tibble() %>%
    extract(value, into = c("indicator", "reference_period", "release"),
            regex = "^(.*\\s.*)\\s*Reporting period:(.*)\\s*Date of issue: (\\d{4}.\\d{2}\\.\\d{2})")
  # I need to use the general whitespace character \s as the normal linebreak \n somehow didn`t work.
  
  entries <- union(entries, current_entries)
  
  next_endpoint <- current_page %>% 
    html_node("li .forward.button") %>% 
    html_attr("href")
  
  if(is.na(next_endpoint)) break # break the loop if we are on the last page

  # update page by following the forward button
  current_page <- read_html(paste0(base_url, next_endpoint))
}

rm(current_page, current_entries, next_endpoint)

destatis_releases <- entries %>% 
  mutate(release = as.Date(release, "%Y.%m.%d"),
         indicator = trimws(indicator))

head(destatis_releases)
## # A tibble: 6 x 3
##   indicator                                        reference_period   release   
##   <chr>                                            <chr>              <date>    
## 1 "Persons employed in manufacturing"              " December 2020"   2021-02-16
## 2 "Indices of wholesale trade prices"              " January 2021"    2021-02-17
## 3 "Indices of the stock of orders and ranges in m~ " December 2020"   2021-02-18
## 4 "Quarterly labour market statistics\n      Deta~ " 4th quarter 202~ 2021-02-18
## 5 "Insolvencies"                                   " November 2020"   2021-02-19
## 6 "Producer price index of industrial products"    " January 2021"    2021-02-19

I hope you found this tutorial useful!

Avatar
Alexandros Melemenidis

Make the data work for YOU.