The next step is to create a script that we will use to parse out product details such as titles, prices, ratings, etc. Create a folder called “parsers” in our project root directory:

$ mkdir parsers

Next create a file called “products.rb” inside this “parsers” folder. Since we set the “page_type” to “products” in our seeder, the seeder and any other pages with this “page_type” will be run against this “products” parser and allow us to extract and save product details. First add the following line to the top of this “products.rb” file:

nokogiri = Nokogiri.HTML(content)

The “content” variable is a reserved work that contains the html content data from the actual page. With this line we are loading the html into Nokogiri so that we can search it easily. Next we are going to start extracting details from this html. Copy and paste the following below the line you just created:

# initialize an empty hash
product = {}

#save the url
product['url'] = page['vars']['url']

#save the asin
product['asin'] = page['vars']['asin']

#extract title
product['title'] = nokogiri.at_css('#productTitle').text.strip

#extract seller/author
seller_node = nokogiri.at_css('a#bylineInfo')
if seller_node
  product['seller'] = seller_node.text.strip
else
  product['author'] = nokogiri.css('a.contributorNameID').text.strip
end

#extract number of reviews
reviews_node = nokogiri.at_css('span#acrCustomerReviewText')
reviews_count = reviews_node ? reviews_node.text.strip.split(' ').first.gsub(',','') : nil
product['reviews_count'] = reviews_count =~ /^[0-9]*$/ ? reviews_count.to_i : 0

#extract rating
rating_node = nokogiri.at_css('#averageCustomerReviews span.a-icon-alt')
stars_num = rating_node ? rating_node.text.strip.split(' ').first : nil
product['rating'] = stars_num =~ /^[0-9.]*$/ ? stars_num.to_f : nil

#extract price
product['price'] = nokogiri.at_css('#price_inside_buybox', '#priceblock_ourprice', '#priceblock_dealprice', '.offer-price', '#priceblock_snsprice_Based').text.strip.gsub(/[\$,]/,'').to_f

#extract availability
availability_node = nokogiri.at_css('#availability')
if availability_node
  product['available'] = availability_node.text.strip == 'In Stock.' ? true : false
else
  product['available'] = nil
end

#extract product description
description = ''
nokogiri.css('#feature-bullets li').each do |li|
  unless li['id'] || (li['class'] && li['class'] != 'showHiddenFeatureBullets')
    description += li.text.strip + ' '
  end
end
product['description'] = description.strip

#extract image
product['image'] = nokogiri.at_css('#main-image-container img')['src']

# specify the collection where this record will be stored
product['_collection'] = "products"

# save the product to the job’s outputs
outputs << product

Let’s go through this code line by line:

product = {}

First we initialize an empty hash. This is where we will store the data that we extract.

product['url'] = page['vars']['url']
product['asin'] = page['vars']['asin']

Next we are saving the “url” and “asin” values from the page variable that we set in the seeder file.

product['title'] = nokogiri.at_css('#productTitle').text.strip

After that we extract the product title. This line is saying give us the “div” html element with the id, “productTitle” then extract just the text inside the “div” and strip any trailing whitespace characters.

seller_node = nokogiri.at_css('a#bylineInfo')
if seller_node
  product['seller'] = seller_node.text.strip
else
  product['author'] = nokogiri.css('a.contributorNameID').text.strip
end

Next we are extracting either the seller or author. We first look for a link with the id “bylineInfo” and check if the element exists or not with an “if” statement. If Nokogiri cannot find the element, it will return nil, which equates to false, so if the element exists then we know that a seller is present. We extract the text from the seller link and save it to our product hash. If the seller does not exist then the product is most likely a book and has an author name. We check for a link with the class “contributorNameID” and save the text as the author.

reviews_node = nokogiri.at_css('span#acrCustomerReviewText')
reviews_count = reviews_node ? reviews_node.text.strip.split(' ').first.gsub(',','') : nil
product['reviews_count'] = reviews_count =~ /^[0-9]*$/ ? reviews_count.to_i : 0

For these lines we are extracting the number of reviews that a product has. We look for a “span” element with the id “acrCustomerReviewText.” If the “span” exists we get the text from inside the element which should look something like, “300 customer reviews.” In order to get just the number we use split to turn the string into an array using white space as the dividing character. Then the number is the first element in the array, so we get the first element, remove any commas and convert it to an integer. We then do a check with a regular expression to make sure the count that we have is actually a number and save it if it is indeed a number.

rating_node = nokogiri.at_css('#averageCustomerReviews span.a-icon-alt')
stars_num = rating_node ? rating_node.text.strip.split(' ').first : nil
product['rating'] = stars_num =~ /^[0-9.]*$/ ? stars_num.to_f : nil

Next we are extracting the star rating which is just about the same as extracting the number of reviews. We look for a “span” element with class, “a-icon-alt” inside an element with id “averageCustomerReviews.” If it exists we get the inner text, convert it to an array and get the first element of the array. If the rating is a number we convert it to a fixed number to preserve decimal values and save it.

product['price'] = nokogiri.at_css('#price_inside_buybox', '#priceblock_ourprice', '#priceblock_dealprice', '.offer-price', '#priceblock_snsprice_Based').text.strip.gsub(/[\$,]/,'').to_f

With this line we are extracting the price. Amazon has a number of different layouts depending on the type of product and as a result we need to try a number of different “div” “ids” and “classes.” Once we have the Nokogiri element, we get the inner text, remove any dollar signs and convert the result to a fixed number.

availability_node = nokogiri.at_css('#availability')
if availability_node
  product['available'] = availability_node.text.strip == 'In Stock.' ? true : false
else
  product['available'] = nil
end

Here we are checking if a product is available or not. To do this we look for a “div” with the id “availability” and if this div exists, we check it’s inner text to see if it equals “In Stock.”

description = ''
nokogiri.css('#feature-bullets li').each do |li|
  unless li['id'] || (li['class'] && li['class'] != 'showHiddenFeatureBullets')
    description += li.text.strip + ' '
  end
end
product['description'] = description.strip

After that we are saving the product description. Product description can be found in bullet points inside a “div” with id “feature-bullets,” so we grab the list elements inside this div and iterate through each one. We use “unless” to make sure that we only use list elements that either have do not an “id” or a “class” but if they do have a “class”, it needs to be “showHiddenFeatureBullets.” The reason is that often times there are list elements with ids and classes that are not relevant to the product description. For each valid list element, we add it to a description string and then save that description string to our product hash.

#extract image
product['image'] = nokogiri.at_css('#main-image-container img')['src']

The product image is pretty forward. We look for the “img” element inside a “div” with id “main-image-container” and then save the “src,” which is the image url, to our product hash.

product['_collection'] = 'products'

This line sets the “collection” name to “products.” Job outputs are stored in collections and specifying the collection will allow us to query and export the data later.

outputs << product

Finally, we save the Amazon product info to the “outputs” variable which is an array for saving job output.

Now we can update our config.yaml file by specifying our products parser. The config.yaml file should look like the following:

seeder:
 file: ./seeder/seeder.rb
parsers:
 - page_type: products
   file: ./parsers/products.rb

Commit this to Git, and push it to your remote Git repository.

$ git add .
$ git commit -m 'add products parser to config'                                $ git push origin master  

Now that we have pushed our parser to our git repository, we can deploy the scraper again:

$ datahen scraper deploy amazon-asins

DataHen will automatically download this new parser and start to parse all the pages with “page_type” set to “products.” You can keep running the “stats” command from earlier to see how the progress is going. Once the scraper has completed parsing all pages we can export and download the data. We will learn more about exporting in Part IV.