One of the most interesting areas to develop recently in journalism is data journalism – that is using data to tell stories. Whether it is large data or graphs or info graphs, using it can add life and credibility to your journalistic pieces. Journalism is an extremely data-driven industry. You could be doing research on school food quality or animal rights protection, or trying to unfold the truth about corruption among tax inspectors. Regardless of the topic, any journalist meets the same obstacle – trouble with access to information.

How to get the data? This is the question that every amateur journalist asks himself/herself at the start of a project. We present 2 options to do so:

  • Spend many man hours browsing the internet and hand copy-pasting the data needed followed by another chunk of valuable time making sense of the data obtained and trying to save it in the desired format
  • Get the data in the format you need with a web scraper

Now you get the picture on why fewer and fewer people choose the first method and turn to data harvesting services or use online tools to do it themselves. Which brings us to our first proven benefit.

1. Being able to save valuable time

People, in general, are in a constant struggle to save time, to do things more proactively but faster and when there is an opportunity to do something 10 times more effectively you are going to take it no matter what. When it comes to journalism this couldn’t be any more relevant. The industry is hectic, things, stories change their validity at a heartbeat: if you miss the sweet spot, your story is going to be neither interesting nor relevant.

Understandably, web data harvesting has become a great tool for reporters who have the advantage of knowing how to code, as more and more public institutions publish their data on their websites and data harvesting practices made it easier and faster to get hold of it. With bots (same as web scrapers) it is possible to gather a large volume of data for stories. Then, if you are using web data scraping services the unstructured data will be converted into structured, well organized, presentable, ready-to-use document saving you the most valuable asset in the journalism industry – time!

2. Effectively gathering sensitive information

Areas in journalism like investigative writing require heaps of sensitive data to get into the depths of each case and add credibility with relevant research. Sometimes simple research can prove not useful as this type of information is not easy to get by and oftentimes not downloadable or “copyable”. Web scraping comes to the rescue here as Genie in the bottle. Data scraping will help you get information even from the sources that don’t allow copying and downloading, and the more serious the matter, the more data you might need to scrape, for example, criminal records or information on results of official investigations.

It is a very sensitive area and starting journalists often make mistakes and end up breaching serious confidentiality laws. But at the same time, it is exciting and rewarding to untangle some of the most confusing stories with the power of your brain and the immense data made accessible for you with the help of web harvesting practice. As we mentioned before, data journalism, being one of the most exciting paths to discover in journalism is perfect proof of how effectively web scraping works.

However, no matter how effective and time-saving web data harvesting is oftentimes it’s still undergoing growing pains when it comes to legal matters. Because the scraping process appropriates pre-existing content from across the web, there are all kinds of ethical and legal quandaries that confront journalists who hope to leverage scrapers for their own processes.

In previous blog posts, we talked about ethical rules associated with web scraping in general. This industry is not an exception. Quite the opposite – ethical concerns are more than relevant when it comes to web scraping in journalism because people spend a lot of time, money and effort coming up with the data that later is scraped by amateur journalists and as a result they consider web scraping as simple hacking.

An additional challenge for journalists is copyrighted material scraping. When you deal with sensitive data in your investigation you should talk to your lawyer and understand your fair use rights more in-depth. Fair use is based on the belief that the public is entitled to freely use portions of copyrighted materials for purposes of commentary and criticism. Many amateur journalists understand this literally and take it to the extremes publishing personal data not covered by fair use laws. Besides, it is worth the mention that free use should not be confused with fair use as the latter is a legal exception to the exclusive rights the owner has for his/her copyrighted work.

To put it simply, web scraping might be one of the best and fastest ways to aggregate content from the internet and use it in your articles, but it comes with a caveat: it is also the hardest tools to parse from a legal standpoint. Having said that, as it is difficult to prevent web scraping, relevant authorities in many countries find it hard to come up with clear rules of what is legal and ethical and what is not.

In conclusion, as convenient and time saving it is to scrape huge data (even classified) from websites and use the ready information in your stories, it comes with its share of responsibility and legal issues that cannot be neglected. One thing is clear if you understand your rights and gather the information sustainably web scraping can be one of your best friends along your journey from amateur to professional journalism.

What do you think are the benefits of web scraping in journalism? Comment below and let us know!