If you do a Google search for “data scraping ethics” you get almost 270K results returned. The topic itself is rife with debate on both sides, for and against data scraping practices.
But, the real issue does not lie within data scraping as a practice, but more so in what people choose to do with the data they have collected, and the method with which people choose to scrape data. For example, web scraping to collect information for a retailer’s price comparison purposes is one thing, while data scraping to overload a server and shut down a website with a “Denial of Service” attack is quite another, to say the least.
The problem regarding the legality of certain applications of data scraping has left numerous people that use scraping software confused, and thirsty for advice. That being said, the one common denominator that you hear from everyone is that it depends on what you are using data scraping for.
But, not so fast – there are times when even if you are using data scraping for honest, benevolent purposes, that you can be in violation of the law. Today we are going to go over how to play by the rules with data scraping, and why it is important to go about doing such ethically.
Some site admins will not care if you screen scrape their site, but some take it extremely seriously, such American Airlines and Southwest Airlines, who both filed suit against the firm FareChase in the early 2000s (and against another firm doing the same thing). FareChase, in an effort to scrape for the best deals on flights for their customers, were challenged by the airlines as violating the United States Computer Fraud and Abuse Act, in that FareChase was in theory stealing customers away from the companies, while using a company’s’ own copy written materials. While the cases were eventually settled, FareChase and similar firms shuttered their doors soon after.
In the above example, FareChase was found in violation of ethics, and in this case the infraction was punishable by law. However, data scraping is still new technological territory, so not every unethical action is yet covered, and laws vary from country to country. Keep in mind that scraping data in and of itself is not unethical, but it all depends on what you plan to use the data for. For the purpose of clarity, if you are scraping a directly competing company’s customer list in order to get leads, that is unethical. If you are scraping a customer directory for leads, but for non-competitive business, that is not an ethical violation.
Tip: If you read through the T&C or ToU and the language is a little vague, or if you have any doubts whatsoever, there is a quick fix for ambiguity: directly ask the admin. This will clear up any fuzziness regarding the issue. If a site has valuable data for your purposes and you won’t be harming their business, the admin should have no problem with you scraping their site, and if they do, you have saved yourself a major headache.
The Danger of a Denial of Service Charge
Even if web scraping is performed for the most benevolent of reasons, say, for example, conducting academic research, it can easily get out of hand if gone unchecked. An example of this is overly trafficking a website, causing too much activity for the server to handle and shutting down access for other users. This is called Denial of Service.
Denial of Service is a technique that malicious hackers employ to shut down websites that they have an agenda against, for whatever reason. No matter whether you are a hacker or just a business owner conducting research, causing a Denial of Service error to a site can result in legal action taken against you.
One of the most famous cases of Denial of Services charges levelled against a company for excessive data scraping was QVC v Resulty. Resulty, a Pinterest-esque shopping aggregator, was scraping to find real-time pricing updates on QVC’s public website. However, having not limited their degree of scraping, they sent hundreds of search requests that brought down the QVC site for two whole days. Of course, QVC followed with a lawsuit, being that they lost sales for the duration of their site being inaccessible.
Why does this happen? Well, automated data scrapers “read” a website’s content much more quickly than a human being, who would likely scroll through pages and take notes on whatever data they are hunting for. To avoid excessively overloading a server is a bit tricky, as not every site makes it abundantly clear how robust their server is.
Here are some tips and tricks to lower your chances of overloading a site’s server:
- Space out your scraping intervals, to give servers room to breathe
- Set your scraper to operate on off-peak business hours for the site
- Assume that smaller companies use smaller servers, so don’t scrape them as aggressively as, say, a federal government site
Now you should have a decent idea of the ethical pitfalls and behaviors to avoid legal issues and unethical practices of data scraping. To state the obvious, in case you have any doubts, private information should never be scraped, as that makes you a hacker, which we of course do not advocate!
Because navigating data scraping while avoiding legal issues or unethical practices is a tricky business, we suggest using a professional service like DataHen, so you can get the data you need without headaches or hassles.
If you have had any personal experiences with scraping ethics or legality, feel free to share them in the comments section!