Appendix B: R Basics
This appendix provides an overview of various key R properties, including data types and data structures.
Data types and memory/storage
Data loaded into RAM can be interpreted differently by R depending on the data type. Some operators or functions in R only accept data of a specific type as arguments. For example, we can store the numeric values 1.5
and 3
in the variables a
and b
, respectively.
<- 1.5
a <- 3
b + b a
## [1] 4.5
R interprets this data as type double
(class ‘numeric’):
typeof(a)
## [1] "double"
class(a)
## [1] "numeric"
object.size(a)
## 56 bytes
If, however, we define a
and b
as follows, R will interpret the values stored in a
and b
as text (character
).
<- "1.5"
a <- "3"
b + b a
typeof(a)
## [1] "double"
class(a)
## [1] "numeric"
object.size(a)
## 56 bytes
Note that the symbols 1.5
take up more or less memory depending on the data-type they are stored in. This directly links to how data/information is stored/represented in binary code, which in turn is reflected in how much memory is used to store these symbols in an object as well as what we can do with it.
Example: Data types and information storage
Given the fact that computers only understand 0
s and 1
s, different approaches are taken to map these digital values to other symbols or images (text, decimal numbers, pictures, etc.) that we humans can more easily make sense of. Regarding text and numbers, these mappings involve character encodings (in which combinations of 0
s and 1
s represent a character in a specific alphabet) and data types.
Let’s illustrate the main concepts with the simple numerical example from above. When we see the decimal number 139
written somewhere, we know that it means ‘one-hundred-and-thirty-nine’. The fact that our computer is able to print 139
on the screen means that our computer can somehow map a sequence of 0
s and 1
s to the symbols 1
, 3
, and 9
. Depending on what we want to do with the data value 139
on our computer, there are different ways of how the computer can represent this value internally. Inter alia, we could load it into RAM as a string (‘text’/‘character’) or as an integer (‘natural number’) or double (numeric, floating point number). All of them can be printed on screen but only the latter two can be used for arithmetic computations. This concept can easily be illustrated in R.
We initiate a new variable with the value 139
. By using this syntax, R by default initiates the variable as an object of type double
. We then can use this variable in arithmetic operations.
<- 139
my_number # check the class
typeof(my_number)
## [1] "double"
# arithmetic
*2 my_number
## [1] 278
When we change the data type to ‘character’ (string) such operations are not possible.
# change and check type/class
<- as.character(my_number)
my_number_string typeof(my_number_string)
## [1] "character"
# try to multiply
*2 my_number_string
## Error in my_number_string * 2: non-numeric argument to binary operator
If we change the variable to type integer
, we can still use math operators.
# change and check type/class
<- as.integer(my_number)
my_number_int typeof(my_number_int)
## [1] "integer"
# arithmetics
*2 my_number_int
## [1] 278
Having all variables in the correct type is important for data analytics with various sample sizes. However, because different data types must be represented differently internally, different types may take up more or less memory, affecting performance when dealing with massive amounts of data.
We can illustrate this point with object.size()
:
object.size("139")
## 112 bytes
object.size(139)
## 56 bytes
Data structures
For the time being, we have only looked at individual bytes of data. A single dataset can contain gigabytes of data and both text and numeric values. R has several classes of objects that provide different data structures. The data types and data structures used to store data can both affect how much memory is required to hold a dataset in RAM.
Vectors vs. Factors in R
Vectors are collections of values of the same type. They can contain either all numeric values or all character values.
For example, we can initiate a character vector containing information on the hometowns of persons participating in a survey.
<- c("St.Gallen", "Basel", "St.Gallen")
hometown hometown
## [1] "St.Gallen" "Basel" "St.Gallen"
object.size(hometown)
## 200 bytes
Unlike in the data types example above, storing these values as type numeric
to save memory is unlikely to be practical.
R would be unable to convert these strings into floating point numbers. Alternatively, we could consider a correspondence table in which each unique town name in the dataset is assigned a numeric (id) code. We would save memory this way, but it would require more effort to work with the data. Fortunately, the data structure ‘factor’ in basic R already implements this idea in a user-friendly manner.
Factors are sets of categories. Thus, the values are drawn from a fixed set of possible values.
Considering the same example as above, we can store the same information in an object of type class factor
.
<- factor(c("St.Gallen", "Basel", "St.Gallen"))
hometown_f hometown_f
## [1] St.Gallen Basel St.Gallen
## Levels: Basel St.Gallen
object.size(hometown_f)
## 584 bytes
At first glance, the fact that hometown f
consumes more memory than its character vector sibling appears strange.
But we’ve seen this kind of ‘paradox’ before. Once again, the more sophisticated approach has an overhead (here not in terms of computing time but in terms of structure encoded in an object). hometown_f
has more structure (i.e., a number-to-‘factor level’/category label mapping).
This additional structure is also data that must be saved somewhere. This disadvantage, as in previous examples of overhead costs, diminishes with larger datasets:
# create a large character vector
<- rep(hometown, times = 1000)
hometown_large # and the same content as factor
<- factor(hometown_large)
hometown_large_f # compare size
object.size(hometown_large)
## 24168 bytes
object.size(hometown_large_f)
## 12568 bytes
Matrices/Arrays
Matrices are two-dimensional collections of values of the same type, arrays are higher-dimensional collections of values of the same type.
For example, we can initiate a three-row/two-column numeric matrix as follows.
<- matrix(c(1,2,3,4,5,6), nrow = 3)
my_matrix my_matrix
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
And a three-dimensional numeric array as follows.
<- array(c(1,2,3,4,5,6), dim = 3)
my_array my_array
## [1] 1 2 3
Data frames, tibbles, and data tables
Remember that in R, data frames are the most common way to represent a (table-like) dataset. Each column can contain a vector of a specific data type (or a factor), but all columns must be the same length. In the context of data analysis, each row of a data frame contains an observation, and each column contains a characteristic of that observation.
The previous implementation of data frames in R made it difficult to work with large datasets.80 Several newer R implementations of the data-frame concept were introduced with the aim to speed up data processing. One is known as tibble
, and it is implemented and used in the tidyverse
packages. The other is known as data table
, and it is implemented in the data table
-package. Most of the shortcomings of the original ‘data.frame’ implementation, however, have been addressed in subsequent R versions, making traditional data.frames
, tibbles
, and data.tables
more similarly suitable for working with large datasets (for in-memory processing).
Here is how we define a data.table
in R:
# load package
library(data.table)
# initiate a data.table
<- data.table(person = c("Alice", "Ben"),
dt age = c(50, 30),
gender = c("f", "m"))
dt
## person age gender
## 1: Alice 50 f
## 2: Ben 30 m
Lists
Similar to data frames and data tables, lists can contain different types of data in each element. For example, a list could contain several other lists, data frames, and vectors with differing numbers of elements.
This flexibility can easily be demonstrated by combining some of the data structures created in the examples above:
<- list(my_array, my_matrix, dt)
my_list my_list
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
##
## [[3]]
## person age gender
## 1: Alice 50 f
## 2: Ben 30 m
This was not an issue in the early days of R because datasets that were rather large by today’s standards (in the Gigabytes) could not have been handled properly by normal computers anyway (due to a lack of RAM).↩︎