Scraping websites will get you valuable data but often times it is not straightforward. There are challenges such as creating requests (you will need to learn how to code and use a library to create http requests which is what browsers make behind the scenes), setting the correct request headers (if you don’t set request headers such as the language and encoding, a server may return a 403 error instead of the html that you want), throttling (a website may only allow a certain number of requests in a certain amount of time to make sure that you don’t bog down their server), and getting your ip banned (sometimes a website will try and prevent you from crawling and ban your ip so you can’t make requests). We are going to show you how DataHen can handle all these difficult parts of scraping and make it simple for you to get any amount of data you want.
If you prefer to skip this tutorial, you can clone this script directly here.
For this tutorial we are going to show you how to use DataHen to easily scrape information from categories on Ali Express (https://www.aliexpress.com/). Specifically we are going to be scraping from the Women’s clothing category and extracting the following data (also highlighted below): title, image url, discounted price, original price, product categories, skus, rating, number of reviews, number of orders, shipping info, return policy and the guarantee.
We are going to assume you have Ruby 2.5.3 and the Nokogiri gem installed. If not follow this link here for instructions on how to install Ruby. Once Ruby is installed, make sure Rubygems is also installed and then run the following to install Nokogiri:
$ gem install nokogiri
First let’s set up a new DataHen scraper. Install the DataHen Ruby gem with the following command:
$ gem install datahen --source https://[email protected]/datahen/
You should see something similar to the following output after running this command:
Successfully installed datahen-0.2.3 Parsing documentation for datahen-0.2.3 Done installing documentation for datahen after 0 seconds 1 gem installed
Now that we have the DataHen gem installed we need to create our DataHen environment variable token. This will make it so our token is sent with every DataHen request. Run the following command:
$ export DATAHEN_TOKEN=<your_token_Here>
We are now ready to create a scraper. Let’s create an empty directory first, and name it ‘ali-express’:
$ mkdir ali-express
Next let’s go into the directory and initialize it as a Git repository:
$ cd ali-express $ git init .
Now that you are all set up, let's move on to the creating the seeders in Part II.