Chapter 7 Advanced Web Scraping: Data Extraction from Dynamic Websites
7.1 What if there is no API?
The programmable web offers new opportunities for web developers to integrate and share data across different applications over the web. In recent chapters we have learned about some of the key technological aspects of this programmable web and dynamic websites:
- Web Application Programming Interfaces (web APIs): A predefined set of HTTP requests/responses for querying data hosted on the server (or providing data from the client side).
- Extensible Markup Language (XML) and JavaScript Object Notation (JSON): Standards/Syntax to format data that are intended to be both human- and machine-readable and that therefore facilitate the exchange of data between different systems/applications over the web.
- JavaScript/AJAX: A programming language and framework designed to build interactive/dynamic websites. In the AJAX framework a JavaScript program built into an HTML-document/website, could, for example, be triggered by the user clicking on a button. This program might then automatically request additional data (in XML-format) from the server via an API and embed this new data dynamically in the HTML-document on the client-side.
In the context of web mining, these new technologies mean that automated data collection from the web can get both substantially easier or substantially more difficult in comparison to automated data collection from the ‘old’ web. Whether one or the other is the case essentially depends on whether a dynamic website relies on an API, and if so, whether this API is publicly accessible (and hopefully free of charge). If the latter is the case, our web mining task is reduced to (a) understanding how the specific API works (which is usually very easy, since open APIs tend to come with detailed and user-friendly documentations) and (b) knowing how to extract the data of interest from the returned XML or JSON documents (which is usually substantially easier than scraping data from HTML pages). In addition, there might already be a so-called ‘API client’ or ‘wrapper’ implemented in an R package that does all this for us, such as the twitteR
package to collect data from one of Twitter’s APIs. If such a package is available we only have to learn how it can be applied to systematically collect data from the API (as shown in the previous chapter).
In cases where no API is available, the task of automated data collection from a dynamic website/web application can get much more complex because not all the data is provided when we issue a simple GET request based on an URL pointing to a specific webpage. This manuscript covers some of the frequently encountered aspects and difficulties for web mining related to such dynamic sites. The important take-away message is, however, that—unlike with the ‘old’ web—there is not one simple generic approach on which we can build when writing a scraper. The techniques necessary to scrape dynamic websites are much more case-specific and might even require substantial knowledge of JavaScript and other web technologies. Providing all the techniques for automated data collection from dynamic websites would thus go far beyond the scope of this course. However, there is also an alternative approach to dealing with such websites which can be employed rather generally: instead of writing a program that ‘decomposes’ the website to scrape the data from it, we can rely on a framework that allows us to programmatically control an actual web browser and thus simulate a human using a browser (including scrolling, clicking, etc.). That is, we use so-called ‘web browser automation’ instead of simple scrapers.
7.2 Scraping dynamic websites
The first step of dealing with a dynamic website in a web mining task is to figure out where the data we see in the browser is actually coming from. This is where the ‘Developer Tools’ provided with modern browsers such as Chrome or Firefox come into play. The ‘Network Panel’ in combination with the source-code inspector help to evaluate which web technologies are used to make the dynamic website work. From there, we can investigate how the data can be accessed programmatically. For example, we might detect that a JavaScript program embedded in the webpage is querying additional data from the server whenever we scroll down in the browser and that all this additional data is transmitted in XML (before being embedded in the HTML). We then can figure out how the data is exactly queried from the server (e.g., how to build a query URL) in order to automate the extraction of data directly. The question then becomes how we can implement all this in R. In short, the following three steps can get us started:
- Which web technologies are used?
- Given a set of web technologies, how can we theoretically access the data?
- How can we practically collect the data with R?
This section gives insights into some of the web technologies frequently encountered when scraping data from dynamic websites and how to deal with them in R. As pointed out above, this is not a complete treatise of all relevant techniques to scrape data from dynamic websites. Thus, the techniques discussed here might not be relevant or sufficient in other cases.
7.2.2 AJAX and XHR
AJAX (Asynchronous JavaScript And XML) is a set of web technologies often employed to design dynamic webpages. The main purpose of AJAX is to allow the asynchronous (meaning ‘under the hood’) exchange of data between client and server when a webpage is already loaded. This means parts of a webpage can be changed/updated without actually reloading the entire page (as illustrated in Figure 7.2).
What w3schools.com calls “a developer’s dream” is a webscraper’s nightmare. The content of a webpage designed with AJAX cannot be downloaded by simply requesting an HTML document with an URL. Additional data will be embedded in the page as the user is scrolling through it in the browser. Thus what we see in the browser is not what we get when simply requesting the same webpage with httr
. In order to access these additional bits of data automatically via R, we have to mimic the specific HTTP transactions between the browser and the server that are related to the loading of additional data. These transactions (as illustrated in Figure ?? are usually implemented with a so-called XMLHttpRequest (XHR) object. If we manage to control the exchange between client and server in the context of a dynamic website based on AJAX, figuring out how XHR works in this website is a good starting point.
The following code-example illustrates how the control of XHR via R can be implemented in the case of bloomberg.com. The goal is to scrape the ticker information shown on top of the website’s homepage.
########################################
# Introduction to Web Mining
# Bloomberg Ticker
########################################
# SETUP -------
# load packages
library(httr)
library(xml2)
library(rvest)
# 'TRADITIONAL' APPROACH -----
# fetch the webpage
<- "https://www.bloomberg.com/europe"
URL <- GET(URL)
http_resp # parse HTML
<- read_html(http_resp)
html_doc # extract the respective section according to the XPath expression
# found by inspecting the page in the broswer with Developer Tools
<- './/div[@class="ticker-bar"]'
xpath <- html_nodes(html_doc, xpath = xpath)
ticker_nodes ticker_nodes
## {xml_nodeset (0)}
This approach does not seem to be successful. We don’t get what we see in the browser. When tracing back the origin of the problem it becomes apparent that some of the html body displayed in the browser is missing. When we use the exact same xpath in the browser’s Web Developer Tools console, we get the expected result:
By inspecting the network traffic with the Developer Tools’ Network panel, we notice traffic related to XHR. When having a closer look at these entries (via the ‘Response’ panel) we identify a get request with an URL pointing to https://www.bloomberg.com/markets2/api/tickerbar/global
which returns exactly the data we were looking for in the webpage. A simple way to scrape the data based on this information seems to be to copy/paste this URL and rewrite the code chunk above accordingly. When simply pasting https://www.bloomberg.com/markets2/api/tickerbar/global
into a new browser bar, we see, after all, the entire ticker data provided conveniently in a JSON file.
# mimic XHR GET request implemented in the bloomberg.com website
<- "https://www.bloomberg.com/markets2/api/tickerbar/global"
URL <- GET(URL)
http_resp # inspect the response
http_resp
## Response [https://www.bloomberg.com/tosv2.html?vid=&uuid=abc67731-b47d-11ed-a3f8-6e794d794447&url=L21hcmtldHMyL2FwaS90aWNrZXJiYXIvZ2xvYmFs]
## Date: 2023-02-24 19:58
## Status: 200
## Content-Type: text/html
## Size: 11.7 kB
## <!doctype html>
## <html lang="en">
## <head>
## <title>Bloomberg - Are you a robot?</title>
## <meta name="viewport" content="width=device-widt...
## <meta name="robots" content="noindex">
## <style rel="stylesheet">
## @font-face {
## font-family: BWHaasGroteskWeb;
## font-display: swap;
## ...
Unlike when issuing the apparently same GET-request from within the browser, we don’t get the expected JSON file. Instead bloomberg.com is asking us whether we are robots. This illustrates that in modern websites that are partially based on webtechnologies like AJAX, the interaction between the browser and the server involve many more aspects than simple GET-requests with a URL. Under the hood, bloomberg.com is also verifying whether the request for the XHR-object has actually been issued from an authentic browser. Now, httr
and related packages offer several ways to mimic these exchanges with the server in order to appear more like an authentic browser. However, at this point it might not be worth the while to invest so much time into developing such a sophisticated scraper, given that there is a reasonable alternative: Browser automation.
7.3 Browser Automation with RSelenium
Alternatively to analyzing and exploiting the underlying mechanisms that control the exchange and embedding of data in a dynamic website, browser automation can tackle web mining tasks at ‘a higher level’. Browser automation frameworks allow to programmatically control a web browser and thereby simulate a user browsing webpages. While most browser automation tools were originally developed for web developers to test the functioning of new web applications (by simulating many different user behaviors), it is naturally also helpful for automated data extraction from web sources. Particularly, if the content of a website is generated dynamically.
A widely used web automation framework is Selenium. The R package RSelenium
(Harrison 2022) is built on top of this framework, which means we can run and control browser automation via Selenium directly from within R. The following code example gives a brief introduction into the basics of using RSelenium
for the scraping of dynamic webpages.23
7.4 Installation and Setup
For basic usage of Selenium via R, the package RSelenium
is all that is needed to get started. When running install.packages("RSelenium")
the necessary dependencies will usually be installed automatically. However, in some cases you might have to install Selenium manually and/or install additional dependencies manually in order to make RSelenium
work. In particular, it might be necessary to manually install Java accordingly (see here for detailed instructions). Depending on the operating system, the steps needed to install Selenium locally can differ. It therefore might be easier to run Selenium in a Docker container. See the instructions of how to run Selenium/RSelenium via Docker on Windows or Linux here and on Mac OSX here. In the code example below Selenium is run in a docker container. Also, see the official package vignette to get started with RSelenium.
Running RSelenium
on your computer means running both a Selenium server and Selenium client locally. The server is running the automated browser and the client (here R) is telling it what to do. Whenever we use RSelenium
we thus have to first start the Selenium server (here by starting the docker container that runs the selenium server). Once the server runs, we can initiate the client and connect to the Selenium server with remoteDriver()
, and initiate a new browser session with myclient$open()
. Now we control our robot browser through myclient
.
# load RSelenium
library(RSelenium)
# start the Selenium server (in docker)
system("docker run -d -p 4445:4444 selenium/standalone-firefox")
# initiate the Selenium session
<- remoteDriver(remoteServerAddr = "localhost",
myclient port = 4445L,
browserName = "firefox")
# start browser session
$open() myclient
7.4.1 First Steps with RSelenium
All methods (functions associated with an R object) can be called right from myclient
. This includes all kind of instructions to guide the automated browser as well as methods related to accessing the content of the page that is currently open in the automated browser. For example, we can navigate the browser to open a specific webpage with navigate()
and then extract the title of this page (i.e., the text between the <title>
-tags) with getTitle()
.
# start browsing
$navigate("https://www.r-project.org")
myclient$getTitle() myclient
## [[1]]
## [1] "R: The R Project for Statistical Computing"
Navigating the automated browser in this manner directly follows from how we navigate the browser through the usual graphical user interface. Thus, if we want to visit a number of pages we tell it step by step to navigate from page to page, including going back to a previously visited page (with goBack()
).
# simulate a user browsing the web
$navigate("http://www.google.com/ncr")
myclient$navigate("http://www.bbc.co.uk")
myclient$getCurrentUrl() myclient
## [[1]]
## [1] "https://www.bbc.co.uk/"
$goBack()
myclient$getCurrentUrl() myclient
## [[1]]
## [1] "https://www.google.com/?gws_rd=ssl"
Once a webpage is loaded, specific elements of it can be extracted by means of XPath or CSS selectors. However, with RSelenium
, the ability of accessing specific parts of a webpage is not only used to extract data but also to control the dynamic features of a webpage. For example, to automatically control Google’s search function, and extract the respective search results. Such a task is typically rather difficult to implement with more traditional web mining techniques because the webpage presenting the search results is generated dynamically and there is thus not a unique URL where the page is constantly available. In addition, we would have to figure out how the search queries are actually sent to a Google server.
With RSelenium
we can navigate to the search bar of Google’s homepage (here by selecting the input tag with XPath), type in a search term, and hit enter to trigger the search.
# visit google
$navigate("https://www.google.com")
myclient# navigate to the search form
<- myclient$findElement('xpath', "//input[@name='q']")
webElem # type something into the search bar
$sendKeysToElement(list("R Cran"))
webElem# type a search term and hit enter
$sendKeysToElement(list("R Cran", key = "enter")) webElem
By default, Google opens a newly generated webpage presenting the search results in the same browser window. Thus, the search result is now automatically stored in myclient
and we can access the source code with the getPageSource()
-method. To process the source code, we don’t have to rely on RSelenium
’s internal methods and functions but can also use the already familiar tools in rvest
.
# scrape the results
# parse the entire page and take it from there...
# for example, extract all the links
<- read_html(myclient$getPageSource()[[1]])
html_doc <- html_nodes(html_doc, xpath = "//a")
link_nodes html_text(html_nodes(link_nodes, xpath = "@href"))[2]
## [1] "https://mail.google.com/mail/&ogbl"
This approach might actually be quite efficient compared to RSelenium
’s internal methods. However, we can also use those methods to achieve practically the same.24
# or extract specific elements via RSelenium
# for example, extract all the links
<- myclient$findElements("xpath", "//a")
links unlist(sapply(links, function(x){x$getElementAttribute("href")}))[2]
At the end of a data mining task with RSelenium
we stop/close the Selenium server and Client as follows.
# be a good citizen and close the session
$close()
myclient# stop the docker container running selenium
# system("docker stop $(docker ps -q)")
In practice, RSelenium
can be very helpful when extracting data from dynamic websites as the procedure guarantees that we get exactly what we would get by using a browser manually to extract the data. We thus do not need to worry about cookies, AJAX, XHR, and the like, as long as the browser we are automating with Selenium deals with these technologies appropriately. On the downside, scraping webpages with RSelenium
is usually less efficient and slower than a more direct approach with httr
/rvest
.25
Given the example code above, RSelenium
can seamlessly be integrated in the generic web-scraper blueprint used in previous chapters. We can simply implement the first component (interaction with the web server, parsing of HTML) with RSelenium
and the rest of the scraper with rvest
et al.
References
This example is based on a similar code example in Munzert et al. (2014, 248). The original example code is based on other R packages.↩︎
The base URL for search queries is
http://www.biblio.com/search.php?
with some search parameters and values (e.g.,keyisbn=economics
).↩︎In this example, the form button to submit the data has no name attribute. By default
submit_form()
is expecting the button to have a name attribute andset_values()
would by default call the buttonunnamed
(whichsubmit_form
does not understand). Hence,names(form_i$fields)[4] <- ""
sets the name of the button to an empty (""
), and we tellsubmit_form()
that the button we want to use has no (an empty) name (submit = ""
).↩︎There are several other ways of achieving the same in R (by rather ‘manually’ setting cookies). However, the shown functionality provided in the
rvest
package is more user-friendly.↩︎The code example is partly based on
RSelenium
vignette on CRAN. For a detailed introduction and instructions of how to set up Selenium andRSelenium
on your machine, see theRSelenium
vignette on CRAN.↩︎Note that
RSelenium
andrvest
rely on different XPath engines, meaning that an XPath expression might work in the functions of one package but not in the other.↩︎Note that scraping tasks based on Selenium can be speeded up by using several clients in parallel. However, the argument of computational efficiency still holds.↩︎