References

Arthur, Charles. 2014. “Facebook Emotion Study Breached Ethical Guidelines, Researchers Say.” Guardian, June. https://www.theguardian.com/technology/2014/jun/30/facebook-emotion-study-breached-ethical-guidelines-researchers-say".
Giles, Jim. 2010. “Data Sifted from Facebook Wiped After Legal Threats.” New Scientist, March. https://www.newscientist.com/article/dn18721-data-sifted-from-facebook-wiped-after-legal-threats/.
Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. 2014. “Experimental Evidence of Massive-Scale Emotional Contagion Through Social Networks.” PNAS 111 (24). http://www.pnas.org/content/111/24/8788.full.pdf.
Liu, Bing. 2011. Web Data Mining. New York, NY: Springer.
Markham, Annette, and Elizabeth Buchanan. 2012. “Ethical Decision-Making and Internet Research: Recommendations from the AoIR Ethics Working Committee (Version 2.0).” Report. Association of Internet Researchers. http://aoir.org/reports/ethics2.pdf.
O’Brien, David. 2014. In the Age of the Web, What Does ‘Public’ Mean? In Internet Monitor 2014: Reflections on the Digital World: Platforms, Policy, Privacy, and Public Discourse, edited by Urs Gasser, Jonathan Zittrain, Robert Faris, and Rebekah Heacock Jones. Berkman Center Research Publication, No. 2014-17. Berkman Center for Internet & Society at Harvard University. https://dash.harvard.edu/bitstream/handle/1/13632937/IM2014_ReflectionsontheDigitalWorld%5B1%5D.pdf.
Snell, James, and Nicola Menaldo. 2016. “Web Scraping in an Era of Big Data 2.0.” Bloomberg Law. https://www.bna.com/web-scraping-era-n57982073780/.

  1. Another reason to focus on US cases is that a large part of social science research based on automatically extracted data from the Web is concerned with data provided by/hosted by US firms: Facebook, Twitter, Google, etc. Note that the basic logic of when web scraping is very likely not ‘OK’ is likely rather similar in the US and other parts of the world (at least in the realm of developed democracies). Moreover, ethical standards regarding social science research based on automatically extracted data from web sources are likely very similar.↩︎

  2. It should be noted that these guidelines should not be construed as professional legal advice. Rather, they are a set of ethical considerations for Internet researchers designed to make them aware of potential problems.↩︎

  3. Note that these dimensions should not be seen as borders of legal theories with which webscraping might get in contact with. Rather, the categorization into these dimensions aims to clarify along what lines the interests of the web miner and the owner of the website’s content might likely collide and cause conflict.↩︎

  4. Whether or not the distinction between a human user visiting a webpage through a browser or a human user writing a script that visits a webpage is meaningful, is still an open debate. Nevertheless, what is clear is that this distinction is considered as quite relevant by many website owners and it has proven to be relevant from a legal point of view at least in some cases.↩︎

  5. In the US it is under certain circumstances considered a federal crime according to the Computer Fraud and Abuse Act.↩︎

  6. See comments in Wikipedia’s robots.txt: https://en.wikipedia.org/robots.txt.↩︎

  7. In consequence, any user relying on this IP cannot anymore access any content on that website, neither via a crawler nor via a normal web browser. Note that even if the website’s owner would not try to take any legal actions against the crawler, this can be quite problematic for the user and for other people that rely on the same IP-address.↩︎

  8. Note that while the logic of potential conflicts due to business interests can be applied to situations in many jurisdictions, the actual cases and the specifics of the legal disputes discussed here only refer to the US context.↩︎

  9. More US-specific categories of cases broad to court have involved Violation of the Computer Fraud and Abuse Act (CFAA) or analogous state statutes, trespass to chattels, and hot news misappropriation Snell and Menaldo (2016).↩︎

  10. Generally, attribution does not help. When content is licensed under a creative common attribution (not uncommon for online content), attribution might be sufficient. In any case, when content is reproduced/republished as part of presenting the results of a research project it is crucial to consider potential copyright issues.↩︎