How to Web Scrape Walmart with Ruby and Nokogiri Part 2: Seeders

Every DataHen scraper requires a seeder script which tells the scraper which pages to start scraping. A seeder script is a Ruby file that uses Ruby to load urls into a variable called, “pages.” First create a directory for our seeder script:

$ mkdir seeder

Next create a file called, “seeder.rb” inside this seeder directory with the following code:

pages << {
  page_type: "listings",
  method: "GET",
  headers: {"User-Agent" => "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"},
  url: "https://www.walmart.com/browse/movies-tv-shows/4096?facet=new_releases:Last+90+Days",
  fetch_type: "browser"
}

In the ruby script above, we are basically seeding a link to the most recent releases on Walmart within the last 90 days. Please note that “pages” is a reserved variable. It is an array that represents what pages you want to seed. Let’s go through the other values in detail.

The “page_type” is a setting that determines which parser script to use. Later we will create a Ruby parser script called, “listings.”

The “method” is the type of http request we want to make. In this example, we are doing a simple “GET” request which is what your browser would make if you were viewing this url.

For the “headers” setting we are setting a “User-Agent” which is basically a string that represents a browser. Whenever you access a website, your browser includes a “User-Agent” so the website knows how to render the page that you request. By including a “User-Agent” string, we avoid having the Walmart website thinking we are a scraping bot and blocking our requests. You can also leave this “headers” setting out completely and DataHen will create a “User-Agent” for you by randomly submitting one with page request. The “User-Agents” that will be randomly selected are all valid and from the main browsers (Chrome, Firefox, and Internet Explorer), so no need to worry if you leave this out.

The “fetch_type” value is set to “browser” which will allow us to use a headless browser for the request. A headless browser simulates a real browser and will even load the Javascript that is on the page. Walmart uses Javascript to render it’s pages so if we don’t use a “browser” fetch type we won’t be able to retrieve the data we want.

Now that we have created a seeder script we can try it out to see if there are any syntax errors. Run the following command from the root of your project directory:

$ datahen seeder try seeder/seeder.rb

Your should see the following output:

Trying seeder script
=========== Seeding Script Executed ===========
----------- New Pages to Enqueue: -----------
[
  {
    "page_type": "listings",
    "method": "GET",
    "headers": {
      "User-Agent": "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
    },
    "url": "https://www.walmart.com/browse/movies-tv-shows/4096?facet=new_releases:Last+90+Days",
    "fetch_type": "browser",
    "force_fetch": true
  }
]

Now we can commit this seeder to our git repository with the following commands:

$ git add .
$ git commit -m 'created a seeder file'

DataHen scrapers live in git repositories so we will first need to create one. Bitbucket offers free git repositories. Create a Bitbucket account and then a new repository here: https://bitbucket.org/repo/create. Use the git repo address from Bitbucket and push your scraper with the following commands (replace <username> with your Bitbucket username):

$ git remote add origin [email protected]:<username>/walmart-movies.git
$ git push -u origin master

We will need a config file to tell DataHen where to find our files. Create a config.yaml file in the root project directory with the following content:

seeder:
 file: ./seeder/seeder.rb
 disabled: false # Optional. Set it to true if you want to disable execution of this file

Commit this config file on git, and push it to Bitbucket:

$ git add .
$ git commit -m 'add config.yaml file'
$ git push origin master

We can now create a scraper and run it on DataHen. Replace your git repo (should end in .git) in the following command which will create a scraper called, “walmart-movies”:

datahen scraper create walmart-movies [email protected]:<username>/walmart-movies.git --workers 1

Next, we need to deploy from your remote Git repository onto DataHen:

$ datahen scraper deploy walmart-movies

After deploying we can start the scraper with the following command:

$ datahen scraper start walmart-movies

Starting a new scraper will create a new scrape job and run it. Wait a minute and then check the status of this job with the following command:

$ datahen scraper stats walmart-movies

You should see something similar to the following:

{
 "job_id": 70,             # Job ID
 "pages": 1,               # How many pages in the scrape job
 "fetched_pages": 1,       # Number of fetched pages
 "to_fetch": 0,            # Pages that needs to be fetched
 "fetching_failed": 0,     # Pages that failed fetching
 "fetched_from_web": 1,    # Pages that were fetched from Web
 "fetched_from_cache": 0,  # Pages that were fetched from the shared Cache
 "parsed_pages": 0,        # Pages that have been parsed by parsing script
 "to_parse": 1,            # Pages that needs to be parsed
 "parsing_failed": 0,      # Pages that failed parsing
 "outputs": 0,             # Outputs of the scrape
 "output_collections": 0,  # Output collections
 "workers": 1,             # How many workers are used in this scrape job
 "time_stamp": "2019-02-01T22:09:57.956158Z"
}

The “fetched_pages” value of 1 indicates that our scraper has successfully seeded our first page from the seeder. Next we will look at creating parsers to extract data in Part III.

How to Web Scrape Walmart with Ruby and Nokogiri Part 2: Seeders

DataHen

DataHen

How to Easily Scrape Ali Express Part 3: Parsers

How to Easily Scrape Ali Express Part 4: Exporters

How to Easily Scrape Ali Express Part 1: Setup

How to Web Scrape Walmart with Ruby and Nokogiri Part 3: Parsers

How to Web Scrape Walmart with Ruby and Nokogiri Part 1: Setup

Subscribe to DataHen Blog

Subscribe to DataHen Blog