Using Mechanize, I would like to scrape information on this website => http://www.africanbookscollective.com
This is the information I would like to gather:
- All Books listed under the category Fiction
Under this category, I want:
- Author name
- Book Title
- isbn number
- Publisher
- Country
I have figured out that this url => http://www.africanbookscollective.com/browse/african-literature/fiction gives me the information I want.
The is my current code:
require 'awesome_print'
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.africanbookscollective.com/browse/african-literature/fiction')
a = page.links.each do |link|
puts link.text
end
ap a
This is my first time using mechanize and as such I am not exactly sure how it differs from Nokogiri. The main reason I am using it in this particular case is because I need to extract information across 38 pages (the complete list of Books tagged Fiction).
ISSUES:
I am getting a really really long output from mechanize that includes links I don't need.
The information I need is not in a div class - it is in a a dl class and I have tried googling for how to select that a dl class but have not had any luck so far.
Each time I have performed a regex operation to remove the links I do not war, i get an empty array back
Can someone, anyone, please help me think of a new way to approach this problem? I really would appreciate feedback.
PS: Here is an image that might shed some more light