Chapter 6 Extracting Data From The Programmable Web

6.1 The role of Web APIs for automated Web data collection

In the previously introduced new web application model powering dynamic/interactive websites, the data integrated in the webpage on the client side is usually exchanged between server and client in standardized formats such as XML and JSON. The concept of web APIs is based on the same idea and specifically aimed at facilitating the integration of web data in various applications/websites over the Internet. Web APIs thus serve as data hubs providing data to various web applications which might further process the data and finally display it in a graphical user interface (e.g. a webpage). Large parts of the explicit exchange of data over the Internet through such applications is thus happening ‘programmatically’ and not ‘manually’ by users explicitly requesting data by typing an URL into the browser bar. At the larger scale, this programmatic integration of data in various web applications over the Internet constitutes the ‘programmable web’.¹⁶ In this programmable web, we can think of APIs as the “[…] the central access points for empirical researchers when they want to systematically collect and analyze data […]” (Matter 2018, 1).

We have previously explored the web technologies that make the programmable web possible. The economic and social factors that have lead to its growth over the last few decades are also important to consider for web mining in a social science context. Today, decision makers in business, government, and non-profit organizations might well consider the data as such (and not a website containing data) the asset that they want to share with users/clients. Take the US non-profit organization Project Vote Smart as an example. The organization’s goal is to make the political process in the United States more transparent for the voters. The World Wide Web is obviously very helpful in this regard. The initial investment as well as the maintenance of a web platform to inform voters about political candidates, officials, and their background and actions in office are manageable fix costs, while variable costs (per website visitor) are extremely low, leading to clearly diminishing marginal costs. Setting up a website following the conventional model solely based on HTTP/HTML would basically fulfill its purpose here. However, consider what Project Vote Smart’s overall objective is. They want the public to be better informed about politics. Thus, while attracting as many citizens as possible to their own website providing all kind of information about politics for free surely helps to reach this goal, what’s really relevant for their mission is not the own website, but to get the information it provides out there. Instead of just providing their data on their own website, they thus set up a web API, allowing web programmers to use and integrate their data in other websites or applications (e.g. an iPhone app). In addition to this supplier-side perspective, the fact that Project Vote Smart’s data is accessed widely by citizens both through their own website and other applications, sets incentives for political candidates and officials to provide accurate and up to date information about themselves on their own, additionally helping Project Vote Smart’s mission. Another example of where the exchange of data facilitated with APIs can be encountered is in supply chains involving different companies. These are usually closed to the public but the underlying logic is the same: data can more easily be exchanged over the web and integrated in different applications.

6.2 API Clients (‘Wrappers’) Instead of Web Scrapers

In cases where we have access to an API that provides data to the website we are interested in, web scraping becomes obsolete. Instead, we write an API client for our research purposes. Thereby, we can recycle the blueprint for simple web scrapers in the sense that the API client consists essentially of the same three basic components: $(i)$ handling the communication with the server (the API), $(ii)$ parsing the response and extracting the data of interest, and $(iii)$ formatting/storing the data for further processing/statistical analysis.

What is changing is that the second component is much easier to implement than in many scrapers because the request sent in the first component (usually encoded in an URL) already guarantees that we get exactly the data we want. Before, we had to inspect the source code of a webpage in order to know how to filter the HTML-document to get the part of the data we are interested in. Now, all we have to do is read the API documentation which is freely available online and which tells us how to construct the URLs we are sending to the API to request specific data. The URLs used in an exchange with a server providing such a web API are thus not anymore pointers pointing to a webpage but rather queries pointing to clearly defined data entities.

The following example to query data from the open and freely accessible Eargast Developer API, providing data on motor race results, illustrates this point. The API’s documentation tells us (see http://ergast.com/mrd/) step by step how to construct a query in the form of an URL.

# PREAMBLE --------

# load packages
library(rvest)
library(httr)
library(jsonlite)

# fix variables
BASE_URL <- "http://ergast.com/api"
SERIES <- "f1"

# COMPONENT I: URL CONSTRUCTION, HTTP REQUEST ----
# set variables for query
season <- "2008"
round <- "2"
option <- "drivers"
format <- "json"

# construct URL
query <- paste(BASE_URL, SERIES, season, round, option, sep = "/" )
URL <- paste(query, format, sep = ".")

# HTTP request, handle response
api_resp <- GET(URL)
if (api_resp$status_code==200) {
     json_doc <- httr::content(api_resp, as = "text")
} else {
     stop("HTTP response not OK!")
}


# COMPONENT II: PARSE/EXTRACT DATA ----
# parse raw jason data
api_data <- fromJSON(json_doc)
# extract table with data on drivers (as data.frame)
drivers_table <- api_data$MRData$DriverTable$Drivers

# COMPONENT III: FORMAT/STORE DATA ----
# write to csv file
write.csv(drivers_table, file = "data/ergast_drivers.csv", row.names = FALSE)

6.4 Twitter Mining Introduction

Twitter offers several APIs for web developers. Many features of which are accessible with ready-made R-packages. In this example, we use twitteR to access Twitter’s REST API. Before we get started with the data collection from Twitter, we have to get the credentials to access their REST API.¹⁷

# PREAMBLE----

# install the Twitter API client (if not installed yet):
# install.packages("twitteR")

# load packages
library(twitteR)

# set access variables 
# (copy paste from your registered app on https://apps.twitter.com/app ),
# replacing the xxx
api_key <- 'xxx'
api_secret <- 'xxx'
access_token <- 'xxx'
access_secret <- 'xxx'

Once we have all the necessary credentials to access the API via R, we can use the high-level function setup_twitter_oauth() provided in the twitteR package in order to authenticate and set up a session with the API. If all credentials are correct, we can then directly start querying the API for data on tweets.

# set up the authentication for R to use the API with your credentials
setup_twitter_oauth(api_key, api_secret, access_token, access_secret)

## [1] "Using direct authentication"

# HARVEST TWEETS ----
# search Tweets mentioning a key word (here St. Gallen)
some_tweets <- searchTwitter('St. Gallen', n=100, lang='en')

# have a look at the search results
head(some_tweets,3)

## [[1]]
## [1] "GallenSchool: Well done to Gallen Community School Minor Camogie team who defeated St. Fergal's College Rathdowney yesterday in t… https://t.co/gt3VcR1WAC"
## 
## [[2]]
## [1] "rumor76358: RT @MySwitzerland_e: St. Gallen @sgbtourismus is not only a business location for young start-ups, but above all also a green city for youn…"
## 
## [[3]]
## [1] "PanoScout: RT @PanoScout: 🕵️‍♂️ A scouting report of the young star of FC St. Gallen, Julian von Moos. A player who’s got great stats this season, 8 G…"

The twitteR package provides several high-level functions to explore Twitter in order to get started with a Twitter-Mining task (see the documentation of the package for details):

searchTwitter(): Search of Twitter based on a search string. The search can be further specified by a set of function arguments: n, the maximum number of tweets to return (note that a large n can take quite a while to process); lang, tweets restricted to a specific language; since/until define the time-range of the search; geocode, filter for tweets issued within a given radius of the given latitude/longitude.
getUser(): Retrieve information about a specific Twitter user.
userTimeline(): Retrieve the timeline of a specific Twitter user.
retweet()/retweeters(): These functions return data on the retweets of a specified tweet and data on the users retweeting, respectively.

In addition, the package provides a bunch of helpful functions to clean, format, and store the retrieved Twitter data:

twListToDF(): coerces a list of objects from a twitteR class to a data.frame.
strip_retweets(): removes retweets from a list of status objects (the search results when using searchTwitter()).

6.4.1 “Spy Game”: A simple twitter mining tutorial

Following the blueprint of an R-script for simple web data mining tasks, this exercise documents how the twitteR package can be applied to explore simple descriptive research questions. The exercise serves both as an example of how the basic functionality of the package can be used to collect data for a social science research project as well as gives some insight into why social media (and Twitter in particular) open new opportunities for researchers in the social sciences to empirically investigate questions that are very hard to tackle without these new data sources.

Practically, it is likely difficult to figure out what spies at the National Security Agency (NSA) have on their mind by surveying/interviewing them on a daily basis. Twitter might be an alternative data source here. Imagine some NSA agents tweet about what’s on their minds when walking to their car after a long day of work (see image shown in Figure 6.2). By means of Twitter’s API we could systematically collect these tweets, at least if they are geo-coded. Since the compound around NSA’s headquarter at Fort Meade is likely almost exclusively accessible to persons that work for the NSA, it is reasonable to assume that geo-located tweets issued from this area might be written by NSA employees, or concern official tweets issued by the NSA. We thus write a short script using the twitteR package in order to collect such tweets (note that these tweets are obviously publicly accessible and could equally well be looked at on twitter.com).

Figure 6.2: Headquarters of the NSA at Fort Meade, Maryland. Source: Wikimedia Commons (public domain).

We start with setting up the header of the script and entering our API credentials. Note the GEOCODE variable to which we assign the coordinates of Fort Meade (as per Google maps).

########################################
# Introduction to Web Mining 2017
# 6: Programmable Web, Social Media
#
# Twitter Mining Exercise: Spy Game
# U.Matter, October 2017
########################################

# PREAMBLE ----------------------

# load packages
library(twitteR)
library(stringr)
library(tm)
library(ggplot2)

# Set fix variables
# set access variables
# (copy/paste from your registered app on https://apps.twitter.com/app ),
# replacing the xxx
api_key <- 'xxx'
api_secret <- 'xxx'
access_token <- 'xxx'
access_secret <- 'xxx'
# set query-specific variables
# where to search for tweets (NSA Headquarter)
longlat_nsa <- "39.107792,-76.745032"
#longlat_cia <- "38.952260,-77.144899"
radius <- "1.5mi"
GEOCODE <- paste(longlat_nsa, radius, sep = "," )
KEYWORD <- " "

In the first component of the script, we initiate the API session and then use searchTwitter() to look for tweets issued in the defined geographical region. For the moment, we simply collect all tweets issued in this region (this could be further refined by using specific keywords, see KEYWORD above). In a second step, we remove retweets from the results. The reason for this is that a large part of the results (likely depending on the day) are related to the NSA’s official tweets which are rather often re-tweeted compared to other tweets issued in the same area.

# COMPONENT I) Authenticate, start Twitter session, query data

# set up the authentication for R to use the API with your credentials
setup_twitter_oauth(api_key, api_secret, access_token, access_secret)

## [1] "Using direct authentication"

# collect geo-coded tweets and remove retweets
spy_tweets <- searchTwitter(searchString = KEYWORD, geocode = GEOCODE, n = 500)

## Warning in doRppAPICall("search/tweets", n, params =
## params, retryOnRateLimit = retryOnRateLimit, : 500
## tweets were requested but the API can only return 136

spy_tweets <- strip_retweets(spy_tweets)

In the second component we first extract and clean the data returned from the API: First, we use the twListToDF()-function to extract the returned tweets in the form of a data-frame. Then, we ‘clean’ the actual texts of the tweets, by removing all non-graphic characters with str_replace_all() and a regular expression ([^[:graph:]]).¹⁸.

# COMPONENT II) Extract and clean data ----
# coerce search results to data.frame
spy_tweets_df  <- twListToDF(spy_tweets)
# remove all non-graphic characters
spy_tweets_df$text<- str_replace_all(spy_tweets_df$text,"[^[:graph:]]", " ")

Then, we specifically extract only the part of the tweet-texts that we want to process/analyze further. For this, we make use of the text-mining package tm. In a first step, we build a text corpus from the extracted tweet-texts (Corpus(VectorSource(spy_tweets_df$text))). We then use the function tm_map to reformat and further clean the extracted texts. These steps are rather standard at the beginning of a text analysis exercise. The overall goal, thereby, is usually to remove redundancies and noise from the data (e.g., removing stopwords such as ‘a’, ‘and’, ‘but’).

# Text Mining Part
# build a text corpus, and specify the source to be character vectors
spy_corpus <- Corpus(VectorSource(spy_tweets_df$text))
# convert text to lower case
spy_corpus <- tm_map(spy_corpus, content_transformer(tolower))
# remove all punctuation
spy_corpus <- tm_map(spy_corpus, removePunctuation)
# remove numbers
spy_corpus <- tm_map(spy_corpus, removeNumbers)
# remove stopwords
spy_corpus <- tm_map(spy_corpus, removeWords, stopwords("english"))

In the last component of the script, we analyze the data with a few basic text mining techniques. The function TermDocumentMatrix() returns a term-document matrix in which cell-values contain the number of times a term $i$ occurs in document (here: tweet) $j$. We can then have a look at which terms are particularly frequent findFreqTerms(spy_tdm, lowfreq=20).

# COMPONENT III) analyze data ----

# Text Mining
# generate term-document-matrix
spy_tdm <- TermDocumentMatrix(spy_corpus)
# what are the frequent words?
findFreqTerms(spy_tdm, lowfreq=5)[1:5]

## [1] "work"  "one"   "’s"    "’m"    "great"

Finally, we aggregate the results by summing up the term frequencies for individual terms over all tweets (term.freq <- rowSums(as.matrix(spy_tdm))). And select the subset of terms that occur 10 or more times which we then store in a data-frame where rows are ordered according to the frequency of the terms.

# compute word-frequency
term.freq <- rowSums(as.matrix(spy_tdm))
term.freq <- subset(term.freq, term.freq >=3)
# remove remaining special characters
term.freq <- term.freq[!grepl(pattern = "[^[:alnum:]]", x = names(term.freq))]
#arrange in data-frame
spy_term_df <- data.frame(term = names(term.freq), freq = term.freq)
spy_term_df <- spy_term_df[order(spy_term_df$freq),]

The resulting word counts then visualized with functions provided in the ggplot2 package.

# plot word-frequency
ggplot(spy_term_df, aes(x=term, y=freq)) +
     geom_bar(stat = "identity", fill = "steelblue") +
     xlab("Term")+
     ylab("Frequency") +
     theme_light() +
      #theme(text= element_text(family="Arial", size=12)) +
     coord_flip()

6.5 Opportunities and Challenges

The technical opportunities of the programmable web as a data source for social science research become most apparent when we compare the data collection procedure with data collection from a website via web scraping. If an API is available and accessible, collecting data from it is usually straightforward as the API documentation provided by the API provider describes exactly what data can be accessed with which API method and in what format the data will be provides. If the same data is embedded in a website, we have to figure out ourselves how to access and extract it from the source code.

As web APIs are often encountered in a context where the web application essentially relies on the exchange with and contribution of data from many users (such as in the social media context), APIs also offer interesting opportunities from a social perspective.

Key aspects of such opportunities are:

The data generation happens independent of the specific research question at hand (no framing issues as in surveys).
Connections between individuals and shared interests are easy to detect (social network analysis).
Real-time capturing/time-stamping of individual’s actions/communications.
Straightforward detection of individuals’ location (if message/status is geo-coded).

There are, of course, also some important limitations when it comes to data collected from web APIs (particularly in the realm of social media). In the context of many platforms that provide and generate data via APIs, we have to take into consideration that the data set we are compiling to address our research question is likely suffering from selection bias. In the case of Twitter, for example, we can only observe those individuals actually using Twitter, and the fact that these individuals choose to use the platform is not random. In some cases such as analyses based on geo-coded tweets, the selection is even more specific (because not all Twitter users agree to have their tweets geo-coded). We thus have to be very careful with generalizing and extrapolating any findings based on Twitter. Another potential reason for concern is that even if we are only interested in the specific sample of people having a Twitter account, what those people communicate on Twitter might not reflect what they do outside social media. That considered, the most interesting questions to be investigated with social media data often take into account the very nature of the social media platform at hand (and not regard it as a one-to-one substitute for survey data or other observational data). Thus, if we are explicitly interested in Twitter users and their tweet behavior, obviously, Twitter is a good source. If we want to learn something about society at large, Twitter might not be the best source.

References

Matter, Ulrich. 2018. “RWebData: A High-Level Interface to the Programmable Web.” Journal of Open Research Software 6 (1): 1–12. https://doi.org/http://doi.org/10.5334/jors.201.

Swartz, Aaron. 2013. “Aaron Swartz’s a Programmable Web: An Unfinished Work.” In Synthesis Lectures on the Semantic Web: Theory and Technology, edited by James Hendler and Ying Ding. Morgan & Claypool Publishers.

The term programmable web is used here as motivated in Swartz (2013).↩︎
In order to get these credentials one needs to be registered as a Twitter user and then register to get access to the API: https://apps.twitter.com/).↩︎
See ?str_replace_all and https://en.wikipedia.org/wiki/Regular_expression for details.↩︎