Chapter 8 Data Sources, Data Gathering, Data Import
In this chapter, we put the key concepts learned so far (text files for data storage, parsers, encoding, data structures) together and apply them to master the first key bottleneck in the data pipeline: how to import raw data from various sources and export/store them for further processing in the pipeline.
8.1 Sources/formats
In the previous chapters, we learned how data is stored in text files and how different data structures/formats/syntaxes help to organize the data in these files. Along the way, we have encountered key data formats that are used in various settings to store and transfer data:
- CSV (typical for rectangular/table-like data)
- Variants of CSV (tab-delimited, fix length, etc.)
- XML and JSON (useful for complex/high-dimensional data sets)
- HTML (a markup language to define the structure and layout of webpages)
- Unstructured text
Depending on the data source, data might come in one or the other form. With the increasing importance of the Internet as a data source for economic research, properly handling XML, JSON, and HTML is becoming more important. However, in applied economic research, various other formats can be encountered:
- Excel spreadsheets (
.xls
) - Formats specific to statistical software packages (SPSS:
.sav
, STATA:.dat
, etc.) - Built-in R datasets
- Binary formats
While we will cover/revisit how to import all of these formats here, it is important to keep in mind that the learned fundamental concepts are as important (or even more important) than knowing which function to call in R for each of these cases. New formats might evolve and become more relevant in the future for which no R function yet exists. However, the underlying logic of how formats to structure data work will hardly change.
8.2 Data gathering procedure
Before we set out to gather/import data from diverse sources, we should start organizing the procedure in an R script. This script will be the beginning of our pipeline!
First, open a new R script in RStudio and save it as import_data.R
in your code
folder. Take your time to meaningfully describe what the script is all about in the first few lines of the file:
#######################################################################
# Data Handling Course: Example Script for Data Gathering and Import
#
# Imports data from ...
# Input: links to data sources (data comes in ... format)
# Output: cleaned data as CSV
#
# U. Matter, St.Gallen, 2019
#######################################################################
RStudio recognizes different sections in a script, whereby section headers are indicated by ----------
. This helps to organize the script into different tasks further. Usually, it makes sense to start with a ‘meta’ section in which all necessary packages are loaded and fix variables initiated.
#######################################################################
# Data Handling Course: Example Script for Data Gathering and Import
#
# Imports data from ...
# Input: links to data sources (data comes in ... format)
# Output: cleaned data as CSV
#
# U. Matter, St.Gallen, 2019
#######################################################################
# SET UP --------------
# load packages
library(tidyr)
# set fix variables
<- "/rawdata"
INPUT_PATH <- "/final_data/datafile.csv" OUTPUT_FILE
Finally, we add sections with the actual code (in the case of a data import script, maybe one section per data source).
#######################################################################
# Project XY: Data Gathering and Import
#
# This script is the first part of the data pipeline of project XY.
# It imports data from ...
# Input: links to data sources (data comes in ... format)
# Output: cleaned data as CSV
#
# U. Matter, St.Gallen, 2019
#######################################################################
# SET UP --------------
# load packages
library(tidyr)
# set fix variables
<- "/rawdata"
INPUT_PATH <- "/final_data/datafile.csv"
OUTPUT_FILE
# IMPORT RAW DATA FROM CSVs -------------
8.3 Loading/importing rectangular data15
8.3.1 Loading built-in datasets
We start with the simplest case of loading/importing data. The basic R installation provides some example datasets to try out R’s statistics functions. In the introduction to visualization techniques with R and the statistics examples in the chapters to come, we will rely on some of these datasets for simplicity. Note that the usage of these simple datasets shipped with basic R are very helpful when practicing/learning R on your own. Many R packages use these datasets over and over again in their documentation and examples. Moreover, extensive documentations and tutorials online also use these datasets (see for example the ggplot2 documentation). And, they are very useful when searching help on Stackoverflow in the context of data analysis/manipulation with R, as you should provide a code example based on some data that everybody can easily load and access.
In order to load such datasets, simply use the data()
-function:
data(swiss)
In this case, we load a dataset called swiss
. After loading it, the data is stored in a variable of the same name as the dataset (here ‘swiss
’). We can inspect it and have a look at the first few rows:
# inspect the structure
str(swiss)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
# look at the first few rows
head(swiss)
## Fertility Agriculture Examination Education
## Courtelary 80.2 17.0 15 12
## Delemont 83.1 45.1 6 9
## Franches-Mnt 92.5 39.7 5 5
## Moutier 85.8 36.5 12 7
## Neuveville 76.9 43.5 17 15
## Porrentruy 76.1 35.3 9 7
## Catholic Infant.Mortality
## Courtelary 9.96 22.2
## Delemont 84.84 22.2
## Franches-Mnt 93.40 20.2
## Moutier 33.77 20.3
## Neuveville 5.16 20.6
## Porrentruy 90.57 26.6
To get a list of all the built-in datasets, type data()
into the console and hit enter. To get more information about a given dataset, use the help function (e.g., ?swiss
)
8.3.2 Importing rectangular data from text-files
In most cases of applying R for data analysis, students and researchers rely on importing data from files stored on the hard disk. Typically, such datasets are stored in a text file format such as ‘Comma Separated Values’ (CSV). In economics, one also frequently encounters data stored in specific formats of commercial statistics/data analysis packages such as SPSS or STATA. Moreover, when collecting data on your own, you might rely on a spreadsheet tool like Microsoft Excel. Data from all these formats can easily be imported into R (in some cases, additional packages have to be loaded, though). Thereby, what happens ‘under the hood’ is essentially the same for all of these formats. Somebody has implemented the respective parser as an R function that accepts a character string with the path or URL to the data source as input.
8.3.2.1 Comma Separated Values (CSV)
Recall how in this format, data values of one observation are stored in one row of a text file, while commas separate the variables/columns. For example, the following code block shows how the first two rows of the swiss
-dataset would look like when stored in a CSV:
"District","Fertility","Agriculture","Examination","Education","Catholic","Infant.Mortality"
"Courtelary",80.2,17,15,12,9.96,22.2
The function read.csv()
imports such files from disk into R (in the form of a data frame
). In this example, the swiss
-dataset is stored locally on our disk in the folder data
:
<- read.csv("data/swiss.csv") swiss_imported
Alternatively, we could use the newer read_csv()
function, which would return a tibble
.
8.3.2.2 Spreadsheets/Excel
To read excel spreadsheets, we need to install an additional R package called readxl
.
# install the package
install.packages("readxl")
Then we load this additional package (‘library’) and use the package’s read_excel()
-function to import data from an excel-sheet. In the example below, the same data as above is stored in an excel-sheet called swiss.xlsx
, again in a folder called data
.
## New names:
## • `` -> `...1`
# load the package
library(readxl)
# import data from a spreadsheet
<- read_excel("data/swiss.xlsx") swiss_imported
8.3.2.3 Data from other data analysis software
The R packages foreign
and haven
contain functions to import data from formats used in other statistics/data analysis software, such as SPSS and STATA.
In the following example we use haven
‘s read_spss()
function to import a version of the swiss
-dataset stored in SPSS’ .sav
-format (again stored in the folder called data
).
# install the package (if not yet installed):
# install.packages("haven")
# load the package
library(haven)
# read the data
<- read_spss("data/swiss.sav") swiss_imported
8.4 Import and parse with readr
16
The readr
package is automatically installed and loaded with the installation/loading of tidyverse
. It provides a set of functions to read different types of rectangular data formats and is usually more robust and faster than similar functions in the basic R distribution. Each of these functions expects either a character string with a path pointing to a file or a character string directly containing the data.
8.4.1 Basic usage of readr
functions
For example, we can parse the first lines of the swiss dataset directly like this.
library(readr)
##
## Attaching package: 'readr'
## The following object is masked from 'package:rvest':
##
## guess_encoding
read_csv('"District","Fertility","Agriculture","Examination","Education","Catholic","Infant.Mortality"
"Courtelary",80.2,17,15,12,9.96,22.2')
## Rows: 1 Columns: 7
## ── Column specification ─────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): District
## dbl (6): Fertility, Agriculture, Examination, Education, Cath...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 1 × 7
## District Fertility Agricult…¹ Exami…² Educa…³ Catho…⁴ Infan…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Courtelary 80.2 17 15 12 9.96 22.2
## # … with abbreviated variable names ¹Agriculture, ²Examination,
## # ³Education, ⁴Catholic, ⁵Infant.Mortality
or read the entire swiss
dataset by pointing to the file
<- read_csv("data/swiss.csv") swiss
## Rows: 47 Columns: 7
## ── Column specification ─────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): District
## dbl (6): Fertility, Agriculture, Examination, Education, Cath...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In either case, the result is a tibble
:
swiss
## # A tibble: 47 × 7
## District Fertility Agric…¹ Exami…² Educa…³ Catho…⁴ Infan…⁵
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Courtelary 80.2 17 15 12 9.96 22.2
## 2 Delemont 83.1 45.1 6 9 84.8 22.2
## 3 Franches-Mnt 92.5 39.7 5 5 93.4 20.2
## 4 Moutier 85.8 36.5 12 7 33.8 20.3
## 5 Neuveville 76.9 43.5 17 15 5.16 20.6
## 6 Porrentruy 76.1 35.3 9 7 90.6 26.6
## 7 Broye 83.8 70.2 16 7 92.8 23.6
## 8 Glane 92.4 67.8 14 8 97.2 24.9
## 9 Gruyere 82.4 53.3 12 7 97.7 21
## 10 Sarine 82.9 45.2 16 13 91.4 24.4
## # … with 37 more rows, and abbreviated variable names
## # ¹Agriculture, ²Examination, ³Education, ⁴Catholic,
## # ⁵Infant.Mortality
The other readr
functions have practically the same syntax and behavior. They are used for fixed-width files or CSV-type files with other delimiters than commas.
8.4.2 Parsing and data types
From inspecting the swiss
tibble pointed out above, we recognize that read_csv
not only correctly recognizes observations and columns (parses the CSV correctly) but also automatically guesses the data type of the values in each column. The first column is of type double, the second one of type integer, etc. That is, read_csv
also parses each column-vector of the data set with the aim of recognizing which data type it is. For example, the data value "12:00"
could be interpreted simply as a character string. Alternatively, it could also be interpreted as a time
format.
If "12:00"
is an element of the vector c("12:00", "midnight", "noon")
it must be interpreted as a character string. If however it is an element of the vector c("12:00", "14:30", "20:01")
we probably want R to import this as a time
format. Now, how can readr
handle the two cases? In simple terms, the package first guesses for each column vector which type is most appropriate. Then, it uses a couple of lower-level parsing functions (one written for each possible data type) in order to parse each column according to the respective guessed type. We can demonstrate this for the two example vectors above.
read_csv('A,B
12:00, 12:00
14:30, midnight
20:01, noon')
## Rows: 3 Columns: 2
## ── Column specification ─────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): B
## time (1): A
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 3 × 2
## A B
## <time> <chr>
## 1 12:00 12:00
## 2 14:30 midnight
## 3 20:01 noon
Under the hood read_csv()
used the guess_parser()
- function to determine which type the two vectors likely contain:
guess_parser(c("12:00", "midnight", "noon"))
## [1] "character"
guess_parser(c("12:00", "14:30", "20:01"))
## [1] "time"
8.5 Importing web data formats
8.5.1 XML in R17
There are several XML-parsers already implemented in R packages specifically written for working with XML data. Thus, we do not have to understand the XML syntax in every detail to work with this data format in R. The already familiar package xml2
(automatically loaded when loading rvest
) provides the read_xml()
function which we can use to read the exemplary XML-document.
# load packages
library(xml2)
# parse XML, represent XML document as R object
<- read_xml("data/customers.xml")
xml_doc xml_doc
## {xml_document}
## <customers>
## [1] <person>\n <name>John Doe</name>\n <orders>\n <produc ...
## [2] <person>\n <name>Peter Pan</name>\n <orders>\n <produ ...
The same package also has various functions to access, extract, and manipulate data from a parsed XML document. In the following code example, we have a look at the most useful functions for our purposes (see the package’s vignette for more details).
# navigate through the XML document (recall the tree-like nested structure similar to HTML)
# navigate downwards
# 'customers' is the root node, persons are ‘their children'
<- xml_children(xml_doc)
persons # navigate sidewards
xml_siblings(persons)
## {xml_nodeset (2)}
## [1] <person>\n <name>Peter Pan</name>\n <orders>\n <produ ...
## [2] <person>\n <name>John Doe</name>\n <orders>\n <produc ...
# navigate upwards
xml_parents(persons)
## {xml_nodeset (1)}
## [1] <customers>\n <person>\n <name>John Doe</name>\n <o ...
# find data via XPath
<- xml_find_all(xml_doc, xpath = ".//name")
customer_names # extract the data as text
xml_text(customer_names)
## [1] "John Doe" "Peter Pan"
8.5.2 JSON in R18
Again, we can rely on an R package (jsonlite
) providing high-level functions to read, manipulate, and extract data when working with JSON documents in R. An important difference between working with XML- and HTML-documents is that XPath is not compatible with JSON. However, as jsonlite
represents parsed JSON as R objects of class list
and/or data-frame
, we can work with the parsed document as with any other R-object of the same class. The following example illustrates this point.
# load packages
library(jsonlite)
# parse the JSON document shown in the example above
<- fromJSON("data/person.json")
json_doc
# look at the structure of the document
str(json_doc)
# navigate the nested lists, extract data
# extract the address part
$address
json_doc# extract the gender (type)
$gender$type json_doc
## List of 6
## $ firstName : chr "John"
## $ lastName : chr "Smith"
## $ age : int 25
## $ address :List of 4
## ..$ streetAddress: chr "21 2nd Street"
## ..$ city : chr "New York"
## ..$ state : chr "NY"
## ..$ postalCode : chr "10021"
## $ phoneNumber:'data.frame': 2 obs. of 2 variables:
## ..$ type : chr [1:2] "home" "fax"
## ..$ number: chr [1:2] "212 555-1234" "646 555-4567"
## $ gender :List of 1
## ..$ type: chr "male"
## $streetAddress
## [1] "21 2nd Street"
##
## $city
## [1] "New York"
##
## $state
## [1] "NY"
##
## $postalCode
## [1] "10021"
## [1] "male"
8.5.3 Tutorial (advanced): Importing data from a HTML table (on a website)
In the chapter on high-dimensional data, we discussed the Hypertext Markup Language (HTML) as code to define the structure/content of a website and HTML-documents as semi-structured data sources. The following tutorial revisits the basic steps in importing data from an HTML table into R.
The aim of the tutorial is to generate a CSV file containing data on ‘divided government’ in US politics. We use the following Wikipedia page as a data source: https://en.wikipedia.org/wiki/Divided_government_in_the_United_States. The page contains a table indicating the president’s party, and the majority party in the US House and the US Senate per Congress (2-year periods). The first few rows of the cleaned data are supposed to look like this:
## year president senate house
## 1: 1861 Lincoln R R
## 2: 1862 Lincoln R R
## 3: 1863 Lincoln R R
## 4: 1864 Lincoln R R
## 5: 1865 A. Johnson R R
## 6: 1866 A. Johnson R R
In a first step, we initiate fix variables for paths and load additional R packages needed to handle data stored in HTML documents.
# SETUP ------------------
# load packages
library(rvest)
library(data.table)
# fix vars
<- "https://en.wikipedia.org/wiki/Divided_government_in_the_United_States"
SOURCE_PATH <- "data/divided_gov.csv" OUTPUT_PATH
Now we write the part of the script that fetches the data from the Web. This part consists of three steps. First we fetch the entire website (HTML document) from Wikipedia with (read_html()
). Second, we extract the part of the website containing the table with the data we want (via html_node()
). Finally, we parse the HTML table and store its content in a data frame called tab.
The last line of the code chunk below removes the last row of the data frame (you can see on the website that this row is not needed)
# FETCH/FORMAT DATA -------------------
# fetch from web
<- read_html(SOURCE_PATH)
doc <- html_table(doc,fill=TRUE)[[2]]
tab <- tab[-nrow(tab), ] # remove last row (not containing data) tab
Now we clean the data to get a data set more suitable for data analysis. Note that the original table contains information per congress (2-year periods). However, as the sample above shows, we aim for a panel at the year level. The following code iterates through the rows of the data frame and generates for each row per congress several two rows (one for each year in the congress).19
# generate year-level data. frame
# prepare loop
<- list() # the container
all_years <- length(tab$Year) # number of cases to iterate through
n length(all_years) <- n
# generate year-level observations. row by row.
for (i in 1:n){
# select row
<- tab[i,]
row <- row$Year
y #
<- as.numeric(unlist(strsplit(x = y, split = "[\\–\\-]", perl = TRUE))[1])
begin <- as.numeric(unlist(strsplit(x = y, split = "[\\–\\-]"))[2])
end <- data.frame(year=begin:(end-1), president=row$President, senate=row$Senate, house=row$House)
tabnew <- tabnew # store in container
all_years[[i]]
}
# stack all rows together
<- bind_rows(all_years) allyears
In a last step, we inspect the collected data and write it to a CSV file.
# WRITE TO DISK --------------------------------------------------
# inspect
head(allyears)
## year president senate house
## 1 1861 Lincoln R R
## 2 1862 Lincoln R R
## 3 1863 Lincoln R R
## 4 1864 Lincoln R R
## 5 1865 A. Johnson R R
## 6 1866 A. Johnson R R
# write to CSV
write_csv(allyears, file=OUTPUT_PATH)
This is a summary of Chapter 8 in Wickham and Grolemund (2017).↩︎
See
?strsplit
,?unlist
, and this introduction to regular expressions for the background of how this is done in the code example here.↩︎