Appendix B: R Basics

This appendix provides an overview of various key R properties, including data types and data structures.

Data types and memory/storage

Data loaded into RAM can be interpreted differently by R depending on the data type. Some operators or functions in R only accept data of a specific type as arguments. For example, we can store the numeric values 1.5 and 3 in the variables a and b, respectively.

a <- 1.5
b <- 3
a + b

## [1] 4.5

R interprets this data as type double (class ‘numeric’):

typeof(a)

## [1] "double"

class(a)

## [1] "numeric"

object.size(a)

## 56 bytes

If, however, we define a and b as follows, R will interpret the values stored in a and b as text (character).

a <- "1.5"
b <- "3"
a + b

typeof(a)

## [1] "double"

class(a)

## [1] "numeric"

object.size(a)

## 56 bytes

Note that the symbols 1.5 take up more or less memory depending on the data-type they are stored in. This directly links to how data/information is stored/represented in binary code, which in turn is reflected in how much memory is used to store these symbols in an object as well as what we can do with it.

Example: Data types and information storage

Given the fact that computers only understand 0s and 1s, different approaches are taken to map these digital values to other symbols or images (text, decimal numbers, pictures, etc.) that we humans can more easily make sense of. Regarding text and numbers, these mappings involve character encodings (in which combinations of 0s and 1s represent a character in a specific alphabet) and data types.

Let’s illustrate the main concepts with the simple numerical example from above. When we see the decimal number 139 written somewhere, we know that it means ‘one-hundred-and-thirty-nine’. The fact that our computer is able to print 139 on the screen means that our computer can somehow map a sequence of 0s and 1s to the symbols 1, 3, and 9. Depending on what we want to do with the data value 139 on our computer, there are different ways of how the computer can represent this value internally. Inter alia, we could load it into RAM as a string (‘text’/‘character’) or as an integer (‘natural number’) or double (numeric, floating point number). All of them can be printed on screen but only the latter two can be used for arithmetic computations. This concept can easily be illustrated in R.

We initiate a new variable with the value 139. By using this syntax, R by default initiates the variable as an object of type double. We then can use this variable in arithmetic operations.

my_number <- 139
# check the class
typeof(my_number)

## [1] "double"

# arithmetic
my_number*2

## [1] 278

When we change the data type to ‘character’ (string) such operations are not possible.

# change and check type/class
my_number_string <- as.character(my_number)
typeof(my_number_string)

## [1] "character"

# try to multiply
my_number_string*2

## Error in my_number_string * 2: non-numeric argument to binary operator

If we change the variable to type integer, we can still use math operators.

# change and check type/class
my_number_int <- as.integer(my_number)
typeof(my_number_int)

## [1] "integer"

# arithmetics
my_number_int*2

## [1] 278

Having all variables in the correct type is important for data analytics with various sample sizes. However, because different data types must be represented differently internally, different types may take up more or less memory, affecting performance when dealing with massive amounts of data.

We can illustrate this point with object.size():

object.size("139")

## 112 bytes

object.size(139)

## 56 bytes

Data structures

For the time being, we have only looked at individual bytes of data. A single dataset can contain gigabytes of data and both text and numeric values. R has several classes of objects that provide different data structures. The data types and data structures used to store data can both affect how much memory is required to hold a dataset in RAM.

Vectors vs. Factors in R

Vectors are collections of values of the same type. They can contain either all numeric values or all character values.

For example, we can initiate a character vector containing information on the hometowns of persons participating in a survey.

hometown <- c("St.Gallen", "Basel", "St.Gallen")
hometown

## [1] "St.Gallen" "Basel"     "St.Gallen"

object.size(hometown)

## 200 bytes

Unlike in the data types example above, storing these values as type numeric to save memory is unlikely to be practical. R would be unable to convert these strings into floating point numbers. Alternatively, we could consider a correspondence table in which each unique town name in the dataset is assigned a numeric (id) code. We would save memory this way, but it would require more effort to work with the data. Fortunately, the data structure ‘factor’ in basic R already implements this idea in a user-friendly manner.

Factors are sets of categories. Thus, the values are drawn from a fixed set of possible values.

Considering the same example as above, we can store the same information in an object of type class factor.

hometown_f <- factor(c("St.Gallen", "Basel", "St.Gallen"))
hometown_f

## [1] St.Gallen Basel     St.Gallen
## Levels: Basel St.Gallen

object.size(hometown_f)

## 584 bytes

At first glance, the fact that hometown f consumes more memory than its character vector sibling appears strange. But we’ve seen this kind of ‘paradox’ before. Once again, the more sophisticated approach has an overhead (here not in terms of computing time but in terms of structure encoded in an object). hometown_f has more structure (i.e., a number-to-‘factor level’/category label mapping). This additional structure is also data that must be saved somewhere. This disadvantage, as in previous examples of overhead costs, diminishes with larger datasets:

# create a large character vector
hometown_large <- rep(hometown, times = 1000)
# and the same content as factor
hometown_large_f <- factor(hometown_large)
# compare size
object.size(hometown_large)

## 24168 bytes

object.size(hometown_large_f)

## 12568 bytes

Matrices/Arrays

Matrices are two-dimensional collections of values of the same type, arrays are higher-dimensional collections of values of the same type.

For example, we can initiate a three-row/two-column numeric matrix as follows.

my_matrix <- matrix(c(1,2,3,4,5,6), nrow = 3)
my_matrix

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

And a three-dimensional numeric array as follows.

my_array <- array(c(1,2,3,4,5,6), dim = 3)
my_array

## [1] 1 2 3

Data frames, tibbles, and data tables

Remember that in R, data frames are the most common way to represent a (table-like) dataset. Each column can contain a vector of a specific data type (or a factor), but all columns must be the same length. In the context of data analysis, each row of a data frame contains an observation, and each column contains a characteristic of that observation.

The previous implementation of data frames in R made it difficult to work with large datasets.⁸⁰ Several newer R implementations of the data-frame concept were introduced with the aim to speed up data processing. One is known as tibble, and it is implemented and used in the tidyverse packages. The other is known as data table, and it is implemented in the data table-package. Most of the shortcomings of the original ‘data.frame’ implementation, however, have been addressed in subsequent R versions, making traditional data.frames, tibbles, and data.tables more similarly suitable for working with large datasets (for in-memory processing).

Here is how we define a data.table in R:

# load package
library(data.table)
# initiate a data.table
dt <- data.table(person = c("Alice", "Ben"),
                 age = c(50, 30),
                 gender = c("f", "m"))
dt

##    person age gender
## 1:  Alice  50      f
## 2:    Ben  30      m

Lists

Similar to data frames and data tables, lists can contain different types of data in each element. For example, a list could contain several other lists, data frames, and vectors with differing numbers of elements.

This flexibility can easily be demonstrated by combining some of the data structures created in the examples above:

my_list <- list(my_array, my_matrix, dt)
my_list

## [[1]]
## [1] 1 2 3
## 
## [[2]]
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
## 
## [[3]]
##    person age gender
## 1:  Alice  50      f
## 2:    Ben  30      m

R-tools to investigate structures and types

package	function	purpose
`utils`	`str()`	Compactly display the structure of an arbitrary R object.
`base`	`class()`	Prints the class(es) of an R object.
`base`	`typeof()`	Determines the (R-internal) type or storage mode of an object.

This was not an issue in the early days of R because datasets that were rather large by today’s standards (in the Gigabytes) could not have been handled properly by normal computers anyway (due to a lack of RAM).↩︎