0

I'm looking for assistance on the best way to loop through successive pages on a website while scraping relevant data off of each page.

For example, I want to go to a specific site (craigslist in below example), scrape the data from the first page, go to the next page, scrape all relevant data, etc, until the very last page.

In my script I'm using a while loop since it seemed to make the most sense to me. However, it doesn't appear to be working properly and is only scraping data from the first page.

Can someone familiar with Ruby/Mechanize point me in the right direction on what the best way to accomplish this task is. I've spent countless hours trying to figure this out and feel like I'm missing something very basic.

Thanks in advance for your help.

require 'mechanize'
require 'pry'

# initialze
agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari'}
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

# Create an empty array to dump contents into
property_results = []

# Scrape all successive pages from craigslist
while page.link_with(:dom_class => "button next") != nil
    next_link = page.link_with(:dom_class => "button next")  
    page.css('ul.rows').map do |d|  
        property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
        property_results.push(property_hash)    
    end  
    page = next_link.click
end 

UPDATE: I found this, but still no dice:

Ruby Mechanize: Follow a Link

@pguardiario

require 'mechanize'
require 'httparty'
require 'pry'

# initialze
agent = Mechanize.new 
url = "http://charlotte.craigslist.org/search/rea"
page = agent.get(url)

#create Empty Array
property_results = []

# Scrape all successive pages from craigslist
while link = page.at('[rel=next]')
  page.css('ul.rows').map do |d|  
    property_hash = { title: d.at_css('a.result-title.hdrlnk').text }    
    property_results.push(property_hash)
  end
    link = page.at('[rel=next]')
    page = agent.get link[:href]
 end
pry(binding)
Community
  • 1
  • 1
pjw23
  • 73
  • 2
  • 7

1 Answers1

1

Whenever you see a [rel=next], that's the thing you want to follow:

page = agent.get url
do_something_with page
while link = page.at('[rel=next]')
  page = agent.get link[:href]
  do_something_with page
end
pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • THANK YOU SO MUCH FOR YOUR INPUT......I'm getting very close to making this work, however I still have one last hurdle. It appears that my results are coming back as multiple arrays of hashes vs a single array of hashes. I added detail to the original post in hopes you may know how to correct this. Thanks again for your help. – pjw23 Dec 01 '16 at 03:02
  • This is really a separate question. I don't think I should try to answer it here. – pguardiario Dec 01 '16 at 05:04
  • I swear I had it working (at least going to next page) yesterday, but running it today, it's not working again. Can you check/ run the updated code in the original post and let me know what I'm missing. THanks again for your help. – pjw23 Dec 01 '16 at 19:35
  • Do it the way I showed you with do_something_with page a separate method. – pguardiario Dec 01 '16 at 22:42