Web Scraping for Continuous Monitoring of eCommerce Websites

Some call it theft, others call it legitimately gathering business intelligence – and everyone is doing it.

In the age of Big Data; companies realize its value in the eCommerce business. Data points like pricing, product IDs, images, product specifications, brand and many more are extremely useful for a variety of purposes, but at the same time-hard to get. Product data feeds from eCommerce sites are used to gain a competitive advantage over other suppliers. It is one of the most reliable and easiest ways to monitor your competitors and the market despite some companies trying to take steps against allowing it.

The most common use cases of eCommerce product feeds are:

  • Price comparison (comparing data from other sites and displaying the best deals)
  • Affiliate sites ( using content to bring traffic to their site through search engine for commission)
  • Strategy development (Web scraping can play a crucial role here by supplying the most accurate and ready to use data)
  • Decision making (The decision is made by analyzing the data you have. Decisions like when to give discounts? How much discount? For what products? etc. can make a huge difference in generated revenue).

The most common items extracted from eCommerce websites are:

  • Product Name  
  • Price
  • Product Features
  • Product Type
  • Manufacturer & Brand
  • Deals & Offers
  • Product Description
  • Company Description
  • Customer Reviews
  • Rank of product

Let’s talk about 3 large eCommerce companies’ websites on what you can scrape from their pages and what measures are they taking to limit the number of web-harvesters.

Amazon

Indeed, it shouldn’t come as a surprise to anyone that Amazon is the first mention on this list. The reason is simple: it is one of the world’s largest eCommerce websites with the availability of millions of products. Its Data can be used for a variety of purposes (mentioned above).

When scraping a large website like Amazon; the actual web scraping (making requests and parsing HTML) becomes a very minor part of your program. Instead, you spend a lot of time figuring out how to keep the entire crawl running smoothly and efficiently.

Anyone who has ever tried to scrape data from Amazon knows that it’s not the easiest task.

To get the product you need, the scraper needs to dig very deep. The complexity of extracting data depends on the type of anti-scraping mechanisms employed by Amazon.

Although there are many methods in the application level to block bots, Amazon seems to be using IP-based captcha most of the time (completely automated public turing test to tell computers and humans apart). What this means is that, if you download too many pages from the same IP at a very high speed, Amazon will come up with captcha “).

Walmart

Walmart  is the largest retailer in the world with over 10,000 stores globally and close to USD 500 billion in annual revenues. Yet, its online business is a very small part of the whole. Since Walmart is nearing saturation in the U.S. on account of its expansive presence, it is prioritizing its ecommerce business to ensure steady revenue growth in the coming years. This in itself presents huge potential and attracts web-harvesting companies and/or individuals. Usually scraping Walmart for data, web harvesters are looking for:

 

  1. Product details that you can’t get with the Product Advertising API
  2.  Monitoring an item for change in Price, Stock Count/Availability, Rating etc.
  3.  Analysing how a particular Brand is being sold on Walmart
  4.  Analysing Walmart marketplace Sellers
  5.  Analysing Walmart Product Reviews

 

As with Amazon, the Walmart website also uses systems that are focused specifically on the many -faced problem of bots. Bot detection techniques such as traffic analysing, fingerprinting of existing bots and machine learning to understand anticipated site visitor behaviour are all key aspects of an effective methodology Walmart uses to protect its data very successfully.

Target

Since 2001, Target.com has been run in partnership with Amazon.com. The ecommerce giant’s platform powered the Target.com website and Amazon.com handled much of the call centre and fulfilment operations. But a lot has changed since 2001. Ecommerce has matured significantly and in this case, Target wisely realized how important the multi-channel experience is and brought the ecommerce operations in-house in 2011.  From that point on it became another obvious target for web data harvesters due to flash sales and time-limited promotions that act as candy to web scrapers. The methods web scrapers use to harvest data from target.com are not different from the methods used for the other two competitors detailed above as the end-result is pretty much the same; gather intelligence on pricing, sales, product features, product type, etc. As well as target.com protects its data by rate limiting, detecting unusual activity, using captcha or temporarily IP block.

Among various sources of web scraping, web harvesters generally use bots to scrape pricing and product information from target.com; and when they do that, aggregated data is fed to an analytical engine, enabling competitors to match prices and products in close to real time-and seconds can make a difference between the retailer keeping a sale and a scraper “stealing” it. Besides, competitor bots often add items to cart and create hundreds of them just to be abandoned. Such a malicious activity will lower the inventory level real time and show out-of-stock for a genuine user who’s willing to buy.

To sum it up, in an ideal eCommerce world, every retailer would have an API that scrapers will communicate with. It would be instant, up to date and support all levels of complexity. As it would be one standard supported by all retailers; it would be intuitive and easy to use; but in reality eCommerce companies are very protective of their own data for obvious reasons and every day, web scrapers come up with new ways of harvesting the desirable data, resulting in websites’ creating newer ways to protect it.