The following code works but will not iterate to the next page. I have figured out that the website in question uses two different XPath selectors to define the next page link, and I'm unsure how to implement that into code.
in response to comment, here is the source around the selectors in question for page one:
<table class="pager" cellspacing="0">
<tr>
<td>
Items 1 to 72 of 1146 total </td>
<td class="pages">
<strong>Page:</strong>
<ol>
<li><span class="on">1</span></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=2">2</a></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=3">3</a></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=4">4</a></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=5">5</a></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=2"><img src="http://www.example.com/skin/frontend/default-mongo/a033/images/pager_arrow_right.gif" alt="Next Page"/></a></li>
</ol>
</td>
<td class="a-right">
Show <select onchange="setLocation(this.value)">
<option value="http://www.example.com/clothing-accessories?dir=asc&limit=12&order=position">
12 </option>
<option value="http://www.example.com/clothing-accessories?dir=asc&limit=24&order=position">
24 </option>
<option value="http://www.example.com/clothing-accessories?dir=asc&limit=48&order=position">
48 </option>
<option value="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position" selected="selected">
72 </option>
</select> per page </td>
</tr>
</table>
and the exact same selector on all subsequent pages:
<table class="pager" cellspacing="0">
<tr>
<td>
Items 73 to 144 of 1146 total </td>
<td class="pages">
<strong>Page:</strong>
<ol>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=1"><img src="http://www.example.com/skin/frontend/default-mongo/a033/images/pager_arrow_left.gif" alt="Previous Page" /></a></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=1">1</a></li>
<li><span class="on">2</span></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=3">3</a></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=4">4</a></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=5">5</a></li>
<li><a href="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=3"><img src="http://www.example.com/skin/frontend/default-mongo/a033/images/pager_arrow_right.gif" alt="Next Page"/></a></li>
</ol>
</td>
<td class="a-right">
Show <select onchange="setLocation(this.value)">
<option value="http://www.example.com/clothing-accessories?dir=asc&limit=12&order=position&p=2">
12 </option>
<option value="http://www.example.com/clothing-accessories?dir=asc&limit=24&order=position&p=2">
24 </option>
<option value="http://www.example.com/clothing-accessories?dir=asc&limit=48&order=position&p=2">
48 </option>
<option value="http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position&p=2" selected="selected">
72 </option>
</select> per page </td>
</tr>
</table>
On the first page of results the next page link is defined by the XPath selector:
//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[6]/a
On all subsequent pages the next page link is defined by:
//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[7]/a
What part of the code would I change and how to insure that the program iterates to the next page of results regardless of how that next_page_link
is being defined?
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'fileutils'
DATA_DIR = "data-hold/clothing-accessories"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_TOM_URL = "http://www.example.com"
list_url = "#{ BASE_TOM_URL }/clothing-accessories?dir=asc&limit=72&order=position"
loop do
page = Nokogiri::HTML(open(list_url))
rows = page.xpath('//*[@id="product-list-table"]/li')
unless rows.empty?
rows[1..-2].each do |row|
hrefs = row.xpath('//*[@id="product-list-table"]/li/div/a').map{ |a| a['href'] }.uniq
hrefs.each do |href|
remote_url = href
local_fname = "#{ DATA_DIR }/#{ File.basename(href) }"
unless File.exists?(local_fname)
puts "Fetching #{ remote_url }..."
begin
tom_content = open(remote_url).read
File.write(local_fname, tom_content)
puts "\t...Success, saved to #{ local_fname }"
sleep 1.0 + rand
rescue Exception => e
puts "Error: #{ e }"
sleep 5
end
end
end
end
end
next_results_link = page.at('//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[7]/a')
if next_results_link
list_url = next_results_link['href']
puts "\t...Getting next page of results: #{list_url}"
else
break
end
end