Chapter 2 The Internet as a Data Source

2.1 The Internet: Physical and Technological Layer

The Internet is fundamentally a network of small local physical computer networks (connected via copper cables, fiber optic cables, or radio waves). Figure 2.1 depicts a typical local network scheme found in a private home. By connecting a computer in your home to the Internet, your local network, which may include your laptop, phone, printer, modem, and router, becomes a part of this vast network of networks.

Figure 2.1: Illustration of a home network (WLAN).

Internet Service Providers (ISP) serve as hubs, connecting many of the smaller local networks. To get Internet access at home, one usually subscribes to an ISP, that connects the home network to the rest of the Internet via cable TV, phone line, or optical fiber. Larger network infrastructure is then used to connect different ISPs across country borders and even across oceans.

{Routers are the central nodes in most parts of the Internet, connecting different devices and managing the data traffic between them. In order for routers to work properly, each computer/device in the network (and the Internet overall) needs a standardized address, i.e., the IP (Internet Protocol) address that uniquely identifies it in the network. Usually the local router in our home assigns IP addresses that are valid within the local network, and the router itself has an IP address that identifies it in the outside world (in other parts of the Internet).¹ IP addresses are so far based on four numbers (IPv4) with 8 bits each (values of 0 to 255 in decimal).² A typical IP address looks something like this: 216.58.219.196. In order to request a document (i.e., a website) from a computer in the Internet we would thus have to know this computer’s IP address. Fortunately, the Domain Name System (DNS) does the the address ‘look up’ for us by translating URLs (e.g., www.google.com) to IP addresses.

We can demonstrate this in the (Mac OS) terminal by calling a program called nslookup (for ‘Name Server look up’) in order to look up the IP-address of the server with the domain www.google.com.³

nslookup www.google.com

## Server:      127.0.0.53
## Address: 127.0.0.53#53
## 
## Non-authoritative answer:
## Name:    www.google.com
## Address: 142.250.203.100
## Name:    www.google.com
## Address: 2a00:1450:400a:808::2004

The first two lines refer to the local DNS server. The last line of the response gives us the IP address of one of Google’s servers. When typing an URL into the address bar of a web browser, the same is essentially happening ‘under the hood’.

As the Internet is a large network consisting of various small local networks, requesting data from a particular website means sending data packets from your local network via several routers to a machine in another physical local network (potentially far away). Again, we can make use of our computer to illustrate this point. By means of the application traceroute we record what nodes—usually routers with an IP address—the data packet passes through in order to reach the website/server behind the domain (in this example, the homepage of Princeton University).⁴

traceroute www.princeton.edu

## traceroute to www.princeton.edu (104.18.4.101), 30 hops max, 60 byte packets
##  1  fritz.box (192.168.178.1)  7.073 ms  7.031 ms  7.023 ms
##  2  790oer1.fiber7.init7.net (212.51.143.1)  12.163 ms  12.154 ms  12.146 ms
##  3  r1.790see.fiber7.init7.net (141.195.82.131)  12.187 ms  13.350 ms  13.341 ms
##  4  r1glb2.core.init7.net (141.195.82.128)  13.308 ms  13.301 ms  13.294 ms
##  5  r2zrh2.core.init7.net (5.180.135.183)  14.167 ms  14.160 ms  14.152 ms
##  6  r1zrh3.core.init7.net (5.180.135.166)  14.154 ms  3.739 ms  4.904 ms
##  7  r1zrh5.core.init7.net (5.180.134.39)  4.871 ms  5.373 ms  5.326 ms
##  8  194.42.48.14 (194.42.48.14)  4.761 ms  5.851 ms  15.364 ms
##  9  104.18.4.101 (104.18.4.101)  8.931 ms  8.922 ms  8.915 ms

As IP-addresses can be mapped to geographical locations (to a certain degree of precision), we can actually trace the data packet we are sending through the Internet on a map. See, e.g., https://stefansundin.github.io/traceroute-mapper/ for mapping of traceroute terminal output (example in Figure 2.2) or http://www.dnstools.ch/visual-traceroute.html for host traceroute.

Figure 2.2: Map illustrating the route the data packets took to reach princeton.edu. Source: https://stefansundin.github.io/traceroute-mapper/.

Simply contacting a server in the Internet needs only very little data to be transmitted. Commonly, we transfer/download a lot of data when using the Internet, though. If the data to be transferred between two devices on the Internet is larger than one packet, the devices, following the Transmission Control Protocol (TCP), split the data into pieces and ensure that all pieces arrive at the destination. Thus the receiver recognizes due to TCP that a part is missing and will ask the sender to resend it. Each packet (piece) is thus labeled with a number and the receiver checks whether all the numbers have arrived in order to make sure the data is complete.

2.1.1 Links to economic and social science research

Already at this very basic level (not even considering the contents yet), we can use our computers to systematically explore and measure certain aspects of the Internet. While a lot of scientific work exploring the physical dimension of the Internet is located in engineering and computer science, some social scientists are exploiting these opportunities in various ways in their research. The following account of these contributions is only meant to be exemplary and does not attempt to fully cover this literature.

2.1.1.1 The Internet and the Economy (geo-coded IP addresses)

Ackermann, Angus, and Raschky (2017) present a method to join large data sets on IP activity data with geo-location data, generating a large and highly granular panel data set on Internet activity (number of currently active IP addresses ~ number of currently connected devices). They use this data set to study, inter alia, how Internet activity is locally correlated with economic activity (e.g., measured by regional GDP per capita. This research project builds—arguably in a very sophisticated manner—on the very simple idea of ‘scanning’ the physical Internet for connected devices and recording their IP addresses with a time stamp (here, the data is just describing the size/activity of the Internet, actual content is not part of it). Other projects by the same authors and based on the same data set include, for example, an investigation of global sleep patterns and the diffusion of the Internet.

2.1.1.2 Mapping Local Internet Control (domestic networks of ISPs)

Roberts et al. (2011) present a method for mapping national networks of the central nodes (ISPs and other large organizations) that “are responsible for routing traffic both within the larger Internet and within their own networks and as such act as points of technical and political control of the Internet.” In simple terms, they use data on how packets travel within a national network and between national networks (keeping track of the IP numbers). This way the researchers can assess the number and size of the central nodes through which a country’s Internet traffic (both within and to the outside Internet) flows, giving an idea of how political control over the domestic Internet can be exerted.⁵

The authors then discuss how the differences in the physical topology of the domestic networks revealed by this analysis can explain the different attempts/strategies by the respective regimes to control and censor the Internet in China and Russia. Since the dawn of the Internet, the Chinese government has tightly controlled the development of the domestic network, resulting in a few central and state-controlled ISPs processing the majority of traffic. This structure enables highly sophisticated and effective Internet control and censorship in China. After the collapse of the Soviet Union, the network in Russia grew much more independently of state control, resulting in a very different physical topology of the domestic network (potentially making it harder to launch a China-like government surveillance program of the Russian Internet). This could explain why the Russian regime relies on well-trained hacker groups rather than a large centralized government surveillance and censorship program to suppress anti-government activities on the Russian web. As in the previous example, this research project simply takes advantage of the concept of ‘measuring’ the physical dimension of the Internet by sending data packets through it and recording the route they take.

2.3 Economic data generating processes online

One of the main reasons making the Internet an interesting data source for economic research is that the availability of information on economic interactions (e.g., between buyer and seller, politician and voter, author and publisher, etc.) is a core part of the business model of commercial websites (or of the raison d’être of non-commercial websites). Thus the social/economic (not the technical or legal) aspects necessary to make a website work often demand a certain degree of transparency regarding the information involved in the social/economic interactions happening on that website. Edelman (2012, 190) illustrates this point in the context of consumer goods and online auctions: “Consumers and competitors push websites to post remarkable amounts of information online. For example, most retail booksellers would hesitate to share information about which items they sold. Yet eBay posts the full bid history for every item offered for sale, and Amazon updates its rankings of top-selling items every hour.”

The following list gives an overview of the various ways automated data extraction from the web has served as a basis for research projects in different sub-fields of economics (based on Edelman (2012)):

Microeconomics. Bajari and Hortaçsu (2003) “The Winner’s Curse, Reserve Prices, and Endogenous Entry: Empirical Insights from eBay Auctions.”: Bid data from coin sales on eBay reveal bidder behavior in auctions, including the magnitude of the winner’s curse.
Macroeconomics and Monetary Economics. Cavallo (2016). “Scraped Data and Sticky Prices.”: Daily price data from online supermarkets reveal price adjustment and price stickiness.
Financial Economics. Antweiler and Frank (2004). “Is All That Talk Just Noise? The Information Content of Internet Stock Message Boards.”: Finds that online discussions help predict market volatility; effects on stock returns are statistically significant but economically small.
Health, Education, and Welfare. Ginsberg et al. (2009) “Detecting Influenza Epidemics Using Search Engine Query Data.”: Trends in users’ web searches identify influenza epidemics.
Industrial Organization. Chevalier and Goolsbee (2003). “Measuring Prices and Price Competition Online: Amazon.com vs. BarnesandNoble.com.”: Uses publicly available price and rank data to estimate demand elasticities at two leading sellers of online books, finding greater price sensitivity at Barnes & Noble than at Amazon.

References

Ackermann, Klaus, Simon D. Angus, and Paul A. Raschky. 2017. “The Internet as Quantitative Social Science Platform: Insights from a Trillion Observations.” arXiv:1701.05632v1 [q-fin.EC]. arxiv.org. https://arxiv.org/pdf/1701.05632.pdf.

Antweiler, Werner, and Murray Z. Frank. 2004. “Is All That Talk Just Noise? The Information Content of Internet Stock Message Boards.” The Journal of Finance 59 (3): 1259–94. http://www.jstor.org/stable/3694736.

Bajari, Patrick, and Ali Hortaçsu. 2003. “The Winner’s Curse, Reserve Prices, and Endogenous Entry: Empirical Insights from eBay Auctions.” The RAND Journal of Economics 34 (2): 329–55. http://www.jstor.org/stable/1593721.

Cavallo, Alberto. 2016. “Scraped Data and Sticky Prices.” The Review of Economics and Statistics 0 (ja): null. https://doi.org/10.1162/REST\_a\_00652.

Chevalier, Judith, and Austan Goolsbee. 2003. “Measuring Prices and Price Competition Online: Amazon.com and BarnesandNoble.com.” Quantitative Marketing and Economics 1 (2): 203–22. https://doi.org/10.1023/A:1024634613982.

Edelman, Benjamin. 2012. “Using Internet Data for Economic Research.” Journal of Economic Perspectives 26 (2): 189–206. https://doi.org/10.1257/jep.26.2.189.

Ginsberg, Jeremy, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski, and Larry Brilliant. 2009. “Detecting Influenza Epidemics Using Search Engine Query Data.” Nature 457 (7232): 1012–14. https://doi.org/10.1038/nature07634.

Roberts, Hal, David Larochelle, Rob Faris, and John Palfrey. 2011. “Mapping Local Internet Control.” Berkman Center for Internet & Society at Harvard University. http://cyber.harvard.edu/netmaps/mlic_20110513.pdf.

IP addresses are assigned following the Dynamic Host Configuration Protocol (DHCP).↩︎
Due to the insufficient amount of possible unique numbers in this system, it is about to be extended to IPv6 (eight numbers with 16 bits each).↩︎
The same command is also available on Windows machines with identical or very similar usage in the Windows (DOS) command line (depending on the Windows version). See https://www.computerhope.com/nslookup.htm for details.↩︎
Note that the code shown below runs on a Mac or Linux terminal. Use tracert google.com on Windows (DOS).↩︎
The researchers also provide a website with network diagrams and additional statistics based on their method for various countries.↩︎