Legality and Ethics of Webscraping
Web scraping can be a powerful tool for journalists, allowing them to quickly gather large amounts of data. It can also be confusing and intimidating at times. In this post, we’ll look at some things to consider when deciding whether web scraping is a good solution for your data needs.
When can web scraping help me?
Web scraping involves using a computer program to automate data collection. Scraping is most useful when collecting data requires considerable repetition. This could involve downloading files for each county in the state, as demonstrated in this previous Carolina Data Desk post. It could also involve using patterns in text or a webpage to extract data that you are interested in. In my previous story on college basketball, I used this approach to gather data from game play-by-plays. Web scraping can also be useful for saving regularly updated data to collect a wider time range of data.
Is this legal?
The legality of web scraping is less clear than the previous question. There are many factors that can influence whether or not web scraping is legal in certain cases, and it is not always readily apparent.
The bottom line is this: if the data is a public record, scrape away. If not, you should be careful that you will not be breaking any laws by scraping the data. The main potential legal issues involved with web scraping for journalists are the following:
- Copyright Infringement: The central purpose of the Copyright Act is “to secure a fair return for an author’s creative labor and to stimulate artistic creativity for the general good.” This mainly covers reproducing the data publicly, not collecting it. Basically, this means that if you scrape data from a website, you cannot then reproduce it in exactly the same way on your own website. As long as the data is transformed in some way and is not a substitute for the data on the original website, the Copyright Act does not prohibit publishing web scraped data.
- Hot News Misappropriation: If the data you want to scrape is time-sensitive and was gathered at the effort and/or expense of another party, you should not swoop in, scrape the data, and steal the story. However, new insights generated from analyzing scraped data are not considered hot news misappropriation.
There are two additional legal considerations that can be invoked in regards to web scraping. These are less likely to be an issue for journalistic web scraping, but it’s still good to be aware of them. They are:
- Computer Fraud and Abuse Act: This act states that activity that “exceeds authorized access” is illegal. This is aimed at hacking that bypasses technological barriers, such as password-protection. This is probably difficult to do by accident, so I wouldn’t be overly concerned about violating this by web scraping.
- Trespass to Chattels: For web scraping, this involves scraping with such a high volume of access that the website’s server capacity is diminished. You don’t need to worry about this one unless you will be hitting a site thousands and thousands of times in a short time period.
Please note that these are only summaries and general guidelines. For more detailed legal information about any of these considerations, this article is a great resource.