Preface

Background and goals of this book

In the past ten years, “Big Data” has been frequently referred to as the new “most valuable” resource in highly developed economies, spurring the creation of new goods and services across a range of sectors. Extracting knowledge from large datasets is increasingly seen as a strategic asset for firms, governments, and NGOs. In a similar vein, the increasing size of datasets in empirical economic research (both in number of observations and number of variables) offers new opportunities and poses new challenges for economists and business leaders. To meet these challenges, universities started adapting their curricula in traditional fields such as economics, computer science, and statistics, as well as starting to offer new degrees in data analytics, data science, and data engineering.

However, in practice (both in academia and industry), there is frequently a gap between the assembled knowledge of how to formulate the relevant hypotheses and devise the appropriate empirical strategy (the data analytics side) on one hand and the collection and handling of large amounts of data to test these hypotheses, on the other (the data engineering side). While large, specialized organizations like Google and Amazon can afford to hire entire teams of specialists on either side, as well as the crucially important liaisons between such teams, many small businesses and academic research teams simply cannot. This is where this book comes into play.

The primary goal of this book is to help practitioners of data analytics and data science apply their skills in a Big Data setting. By bridging the knowledge gap between the data engineering and analytics sides, this book discusses tools and techniques to allow data analytics and data science practitioners in academia and industry to efficiently handle and analyze large amounts of data in their daily analytics work. In addition, the book aims to give decision makers in data teams and liaisons between analytics teams and engineers a practical overview of helpful approaches to work on Big Data projects. Thus, for the data analytics and data science practitioner in academia or industry, this book can well serve as an introduction and handbook to practical issues of Big Data Analytics. Moreover, many parts of this book originated from lecture materials and interactions with students in my Big Data Analytics course for graduate students in economics at the University of St.Gallen and the University of Lucerne. As such, this book, while not appearing in a classical textbook format, can well serve as a textbook in graduate courses on Big Data Analytics in various degree programs.

A moving target

Big Data Analytics is considered a moving target due to the ever-increasing amounts of data being generated and the rapid developments in software tools and hardware devices used to analyze large datasets. For example, with the recent advent of the Internet of Things (IoT) and the ever-growing number of connected devices, more data is being generated than ever before. This data is constantly changing and evolving, making it difficult to keep up with the latest developments. Additionally, the software tools used to analyze large datasets are constantly being updated and improved, making them more powerful and efficient. As a result, practical Big Data Analytics is a constantly evolving field that requires constant monitoring and updating in order to remain competitive. You might thus be concerned that a couple of months after reading this book, the techniques learned here might be already outdated.

So how can we deal with this situation? Some might suggest that the key is to stay informed of the latest developments in the field, such as the new algorithms, languages, and tools that are being developed. Or, they might suggest that what is important is to stay up to date on the latest trends in the industry, such as the use of large language models (LLMs), as these technologies are becoming increasingly important in the field. In this book, I take a complementary approach. Inspired by the transferability of basic economics, I approach Big Data Analytics by focusing on transferable knowledge and skills. This approach rests on two pillars:

  1. First, the emphasis is on investing in a reasonable selection of software tools and solutions that can assist in making the most of the data being collected and analyzed, both now and in the future. This is reflected in the selection of R (R Core Team 2021) and SQL as the primary languages in this book. While R is clearly one of the most widely used languages in applied econometrics, business analytics, and many domains of data science at the time of writing this book (and this may change in the future), I am confident that learning R (and the related R packages) in the Big Data context will be a highly transferable skill in the long run. I believe this for two primary reasons: a) Recent years have shown that more specialized lower-level software for Big Data Analytics increasingly includes easy-to-use high-level interfaces to R (the packages arrow and sparklyr discussed in this book are good examples for this development); b) even if the R-packages (or R itself) discussed in this book will be outdated in a few years, the way R is used as a high-level scripting language (connected to lower-level software and cloud tools) will likely remain in a similar form for many years to come. That is, this book does not simply suggest which current R-package you should use to solve a given problem with a large dataset. Instead, it gives you an idea of what the underlying problem is all about, why a specific R package or underlying specialized software like Spark might be useful (and how it conceptually works), and how the corresponding package and problem are related to the available computing resources. After reading this book, you will be well equipped to address the same computational problems discussed in this book with another language than R (such as Julia or Python) as your primary analytics tool.

  2. Second, when dealing with large datasets, the emphasis should be on various Big Data approaches, including a basic understanding of the relevant hardware components (computing resources). Understanding why a task takes so long to compute is not always (only) a matter of which software tool you are using. If you understand why a task is difficult to perform from a hardware standpoint, you will be able to transfer the techniques introduced in this book’s R context to other computing environments relatively easily.

The structure of the book discussed in the next subsection is aimed at strengthening these two pillars.

Content and organization of the book

Overall, this book introduces the reader to the fundamental concepts of Big Data Analytics to gain insights from large datasets. Thereby, the book’s emphasis is on the practical application of econometrics and business analytics, given large datasets, as well as all of the steps involved before actually analyzing data (data storage, data import, data preparation). The book combines theoretical and conceptual material with practical applications of the concepts using R and SQL. As a result, the reader will gain the fundamental knowledge required to analyze large datasets both locally and in the cloud.

The practical problems associated with analyzing Big Data, as well as the corresponding approaches to solving these problems, are generally presented in the context of applied econometrics and business analytics settings throughout this book. This means that I tend to concentrate on observational data, which is common in economics and business/management research. Furthermore, in terms of statistics/analytics techniques, this context necessitates a special emphasis on regression analysis, as this is the most commonly used statistical tool in applied econometrics. Finally, the context determines the scope of the examples and tutorials. Typically, the goal of a data science project in applied econometrics and business analytics is not to deploy a machine learning model as part of an operational App or web application (as is often the case for many working in data science). Instead, the goal of such projects is to gain insights into a specific economic/business/management question in order to facilitate data-driven decisions or policy recommendations. As a result, the output of such projects (as well as the tutorials/examples in this book) is a set of statistics summarizing the quantitative insights in a way that could be displayed in a seminar/business presentation or an academic paper/business report. Finally, the context will influence how code examples and tutorials are structured. The code examples are typically used as part of an interactive session or in the creation of short analytics scripts (and not the development of larger applications).

The book is organized in four main parts. The first part introduces the reader to the topic of Big Data Analytics from the perspective of a practitioner in empirical economics and business research. It covers the differences between Big P and Big N problems and shows avenues of how to practically address either.

The second part focuses on the tools and platforms to work with Big Data. This part begins by introducing a set of software tools that will be used extensively throughout the book: (advanced) R and SQL. It then discusses the conceptual foundations of modern computing environments and how different hardware components matter in practical local Big Data Analytics, as well as how virtual servers in the cloud help to scale up and scale out analyses when local hardware lacks sufficient computing resources.

The third part of this book expands on the first components of a data pipeline: data collection and storage, data import/ingestion, data cleaning/transformation, data aggregation, and exploratory data visualization (with a particular focus on Geographic Information Systems, GIS). The chapters in this part of the book discuss fundamental concepts such as the split-apply-combine approach and demonstrate how to use these concepts in practice when working with large datasets in R. Many tutorials and code examples demonstrate how a specific task can be implemented locally as well as in the cloud using comparatively simple tools.

Finally, the fourth part of the book covers a wide range of topics in modern applied econometrics in the context of Big Data, from simple regression estimation and machine learning with Graphics Processing Units (GPUs) to running machine learning pipelines and large-scale text analyses on a Spark cluster.

Prerequisites and requirements

This book focuses heavily on R programming. The reader should be familiar with R and fundamental programming concepts such as loops, control statements, and functions (Appendix B provides additional material on specific R topics that are particularly relevant in this book). Furthermore, the book assumes some knowledge of undergraduate and basic graduate statistics/econometrics. R for Data Science by Wickham and Grolemund (Wickham and Grolemund (2016); this is what our undergraduate students use before taking my Big Data Analytics class), Mostly Harmless Econometrics by Angrist and Pischke (Angrist and Pischke (2008)), and Introduction to Econometrics by Stock and Watson (Stock and Watson (2003)) are all good prep books. Regarding hardware and software requirements, you will generally get along just fine with an up-to-date R and RStudio installation. However, given the nature of this book’s topics, some code examples and tutorials might require you to install additional software on your computer. In most of these cases, this additional software is made to work on either Linux, Mac, or Windows machines. In some cases, though, I will point out that certain dependencies might not work on a Windows machine. Generally, this book has been written on a Pop-OS/Ubuntu Linux (version 22.04) machine with R version 4.2.0 (or later) and RStudio 2022.07.2 (or later). All examples (except for the GPU-based computing) have also been successfully tested on a MacBook running on macOS 12.4 and the same R and RStudio versions as above.

Supplementary Materials: Code Examples, Datasets, and Documentation

A repository of supplementary online resources for this book is available via the book’s dedicated GitHub repository, located at https://github.com/umatter/bigdata. Within this repository, you’ll find the README file which maintains an updated list of vital resources. This includes links to R-scripts, which feature the code examples presented in this book, sources for data and datasets utilized within these pages, and additional documentation to assist in the installation of packages and software highlighted in this book.

For educators considering this book as a primary or supplementary text for a course, there is an additional GitHub repository at https://github.com/umatter/bigdata-lecture. This repository houses a collection of my teaching materials, complete with slide presentations and extra code examples. All these materials are made freely available under a CC BY-SA 2.0 license. If you choose to utilize these resources, please make sure to familiarize yourself with the accompanying usage terms available here: https://creativecommons.org/licenses/by-sa/2.0/.

Thanks

Many thanks go to the students in my past Big Data Analytics classes. Their interest and engagement with the topic, as well as their many great analytics projects, were an important source of motivation to start this book project. I’d also like to thank Lara Spieker, Statistics and Data Science Editor at Chapman & Hall, who was very supportive of this project right from the start, for her encouragement and advice throughout the writing process. I am also grateful to Chris Cartwright, the external editor, for his thorough assistance during the book’s drafting stage. Finally, I would like to thank Mara, Marc, and Irene for their love, patience, and company throughout this journey. This book would not have been possible without their encouragement and support in challenging times.

References

Angrist, Joshua D, and Joern-Steffen Pischke. 2008. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Stock, James H, and Mark W Watson. 2003. Introduction to Econometrics. Pearson Education.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science. O’Reilly Media, Inc.