In our increasingly digital age, more and more businesses are conducted online. Oftentimes, a company’s website is the main conduit through which consumers keep contact with the company. This is especially true for news websites, as it’s an area that was digitized the most in the past years. The decline in print sales and advertising has put significant pressure on news groups and companies to try and find new sources of income.
The initial spike of incoming revenues from digital advertising has been plummeting year after year connected with low mobile costs. Although on the one hand, this means greater volume and accessibility for customers, on the other hand, it means greater risk and the growing complexity of content “stealing”.
In this blog post we will talk about news website scraping practice: the purpose and intentions of it; outline legal ways of obtaining content and the importance of staying a “good web scraping citizen”.
Most people think that you need to learn a programming language to start scraping news websites for information, but it’s not necessarily true and is one of the most common misconceptions about data scraping. The first thing you will learn as a journalist is that you can have 2 competitive advantages over your colleagues- being able to work faster and managing to get more relevant information than others. In both cases- scraping is the quickest solution.
Of course, using a website’s API can be an alternative option. In fact, if all you need is to interact with the system, APIs are great because almost every major system you come across has a developed API or at least intention to develop one, but if your main purpose is extracting data from the page, web scraping is indeed the better solution.
It is true, most websites don’t have strategies that ban web scraping as it may negatively impact on overall user experience. Constant changes and amendments in terms and conditions of “fair use” even fail to keep up with web harvesters who are always looking for new ways to use bots, crawlers or spider programs to successfully harvest and mine text headlines from news websites for various reasons, such as: to create an aggregated news feed; monitor news sites to identify the latest articles; analyse data; extract clean article text automatically; compare and keep track of sports matches, etc.
Whatever your end-intention is, if you’re going to be scraping any site regularly, keep in mind that it’s important to be a good web scraping citizen: if used inappropriately, your script can ruin the experience for the rest of the users.
The main advantage of scraping news websites and overall data is that you can do it with virtually any web site — as long as the content is online, it is possible for you to scrape it, starting from weather forecasts to government spending, even if the particular site does not have an API for raw data access. You want only news articles about “health”? No problem at all! You want blog posts in a certain language? From a specific country? You got it! It is a simple and cost effective solution for obtaining data from the web that will save you a lot of time and money if done “sustainably”, so you could focus on what to do with the obtained data.
- The websites’ country of origin which content you are targeting. You will be surprised to find out how many countries have strict local laws that forbid web harvesting.
- Read the terms and conditions of each website you will be targeting individually. Many of them state clearly “no bots”, and “no content and/ or news duplication.
- The key essence is always the purpose of scraping. Usually, scraping for educational purposes or even maintaining a database of news for personal use carries a little to no risk for the website and the owners might be ok to let you scrape their content turning a blind eye to it. But if you are planning to sell the content to others (especially competitor sources) or are duplicating their content on your website, then they can and most likely will file a lawsuit.
- At the end of the day, you have to be sure that you are not, directly or indirectly, harming the news website’s business. The news websites spend countless days and a lot of efforts to get a genuine news article. Duplication of such unique content hurts their business. It’s not so surprising that they don’t welcome the practice, isn’t it?
Are there any peculiarities you are interested in about news website scraping that we didn’t cover? Make sure to comment below and let us know!