0

Using Mechanize, I would like to scrape information on this website => http://www.africanbookscollective.com

This is the information I would like to gather:

  • All Books listed under the category Fiction

Under this category, I want:

  1. Author name
  2. Book Title
  3. isbn number
  4. Publisher
  5. Country

I have figured out that this url => http://www.africanbookscollective.com/browse/african-literature/fiction gives me the information I want.

The is my current code:

require 'awesome_print'
require 'rubygems'
require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.africanbookscollective.com/browse/african-literature/fiction')
a = page.links.each do |link|
  puts link.text
end

ap a

This is my first time using mechanize and as such I am not exactly sure how it differs from Nokogiri. The main reason I am using it in this particular case is because I need to extract information across 38 pages (the complete list of Books tagged Fiction).

ISSUES:

  1. I am getting a really really long output from mechanize that includes links I don't need.

  2. The information I need is not in a div class - it is in a a dl class and I have tried googling for how to select that a dl class but have not had any luck so far.

  3. Each time I have performed a regex operation to remove the links I do not war, i get an empty array back

Can someone, anyone, please help me think of a new way to approach this problem? I really would appreciate feedback.

PS: Here is an image that might shed some more light

enter image description here

User
  • 23,729
  • 38
  • 124
  • 207
Uzzar
  • 705
  • 1
  • 12
  • 24

1 Answers1

0

You can use scrape4me.com to get the raw output for further process in your project(mechanize) Don't know mechanize but maybe this can help, good luck

Youss
  • 4,196
  • 12
  • 55
  • 109