Workaround for page that uses multiple XPath selectors to define a single link?

Question

The following code works but will not iterate to the next page. I have figured out that the website in question uses two different XPath selectors to define the next page link, and I'm unsure how to implement that into code.

in response to comment, here is the source around the selectors in question for page one:

<table class="pager" cellspacing="0">
    <tr>
        <td>
                    Items 1 to 72 of 1146 total                </td>
                <td class="pages">
            <strong>Page:</strong>
            <ol>
                                                            <li><span class="on">1</span></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=2">2</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=3">3</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=4">4</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=5">5</a></li>
                                                        <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=2"><img src="http://www.example.com/skin/frontend/default-mongo/a033/images/pager_arrow_right.gif" alt="Next Page"/></a></li>
                        </ol>
        </td>

        <td class="a-right">
            Show <select onchange="setLocation(this.value)">
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=12&amp;order=position">
                    12                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=24&amp;order=position">
                    24                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=48&amp;order=position">
                    48                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position" selected="selected">
                    72                </option>
                        </select> per page        </td>

    </tr>
</table>

and the exact same selector on all subsequent pages:

<table class="pager" cellspacing="0">
    <tr>
        <td>
                    Items 73 to 144 of 1146 total                </td>
                <td class="pages">
            <strong>Page:</strong>
            <ol>
                            <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=1"><img src="http://www.example.com/skin/frontend/default-mongo/a033/images/pager_arrow_left.gif" alt="Previous Page" /></a></li>
                                                            <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=1">1</a></li>
                                                                <li><span class="on">2</span></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=3">3</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=4">4</a></li>
                                                                <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=5">5</a></li>
                                                        <li><a href="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=3"><img src="http://www.example.com/skin/frontend/default-mongo/a033/images/pager_arrow_right.gif" alt="Next Page"/></a></li>
                        </ol>
        </td>

        <td class="a-right">
            Show <select onchange="setLocation(this.value)">
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=12&amp;order=position&amp;p=2">
                    12                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=24&amp;order=position&amp;p=2">
                    24                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=48&amp;order=position&amp;p=2">
                    48                </option>
                            <option value="http://www.example.com/clothing-accessories?dir=asc&amp;limit=72&amp;order=position&amp;p=2" selected="selected">
                    72                </option>
                        </select> per page        </td>

    </tr>
</table>

On the first page of results the next page link is defined by the XPath selector:

//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[6]/‌a

On all subsequent pages the next page link is defined by:

//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[7]/‌a

What part of the code would I change and how to insure that the program iterates to the next page of results regardless of how that next_page_link is being defined?

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'fileutils'

DATA_DIR = "data-hold/clothing-accessories"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR)
BASE_TOM_URL = "http://www.example.com"

list_url = "#{ BASE_TOM_URL }/clothing-accessories?dir=asc&limit=72&order=position"

loop do

  page = Nokogiri::HTML(open(list_url))
  rows = page.xpath('//*[@id="product-list-table"]/li')

  unless rows.empty?

    rows[1..-2].each do |row|

      hrefs = row.xpath('//*[@id="product-list-table"]/li/div/a').map{ |a| a['href'] }.uniq

      hrefs.each do |href|

        remote_url = href
        local_fname = "#{ DATA_DIR }/#{ File.basename(href) }"

        unless File.exists?(local_fname)

          puts "Fetching #{ remote_url }..."

          begin
            tom_content = open(remote_url).read
            File.write(local_fname, tom_content)
            puts "\t...Success, saved to #{ local_fname }"
            sleep 1.0 + rand
          rescue Exception => e
            puts "Error: #{ e }"
            sleep 5
          end  

        end 

      end 

    end

  end


  next_results_link = page.at('//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[7]/a')

  if next_results_link
    list_url = next_results_link['href']
    puts "\t...Getting next page of results: #{list_url}"
  else
    break
  end

end

You should provide some sample input around that next link or URIs, so we can identify a pattern to match on. By the way, **never use any domain names other than example.{com,net,org,edu} for providing example domains**. They're defined especially for this purpose, and all others will probably belong to others and confuse readers. — Jens Erat, Dec 01 '13 at 12:45
Thanks for the input Jens, I added some relevant source data. The Tin Man was able to provide an alternative using each_with_index but I was able to nest that properly within my loops. I also tried using the css selector of next_results_link = page.at("li a img") but this threw an error "can't convert nil into string". Thanks in advance for your assistance. — jcuwaz, Dec 01 '13 at 17:15
Thanks for posting more details, this is a very well-posted question now. XPath predicates are much more powerful than you seem to know (until now): It's very easy to match the image contained in that link. — Jens Erat, Dec 01 '13 at 19:08

score 1 · Accepted Answer · edited May 23 '17 at 10:31

In this link, there's an image contained with alternative text "Next Page". Take advantage of this:

//td[contains(@class, 'pages')]/ol/li/a[img/@alt='Next Page']

If you prefer an complete path, you can easily apply the selector of this XPath expression to the beginning of the one fetched above. I'd even go further and use //td[contains(@class, 'pages')]//a[img/@alt='Next Page'] to further decouple your code from the XML structure.

For matching class attributes you should also consider using a more correct version, but it makes the expression a little bit more complicated. Have a look at this question on matching XML classes.

Works smoothly again. Thanks so much Jens. – jcuwaz Dec 01 '13 at 22:00 — jcuwaz, Dec 01 '13 at 22:00

the Tin Man · Answer 2 · 2013-12-03T04:23:14.720

Why don't you do something like:

rows[1..-2].each_with_index do |row, i|

  ...

  xpath_index = if i == 1
    '6'
  else
    '7'
  end

  next_results_link = page.at(%Q!//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[#{ xpath_index }]/a!)
  ...

end

This will give you an idea what it's doing:

xpath_index = 6
%Q!//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[#{ xpath_index }]/a!
# => "//*[@id=\"bodyblock\"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[6]/a"

xpath_index = 7
%Q!//*[@id="bodyblock"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[#{ xpath_index }]/a!
# => "//*[@id=\"bodyblock\"]/div/div[2]/div[2]/div[3]/table[3]/tbody/tr/td[2]/ol/li[7]/a"

Also, just so you know, you're dealing with a non-ASCII character in your XPath. How it got there I don't know but the trailing /a isn't valid. It currently is:

'/‌a'.codepoints.to_a # => [47, 8204, 8203, 97]

And should be:

'/a'.codepoints.to_a # => [47, 97]

"page.at(%Q!" selector syntax is new to me and I haven't seen it referenced in any of my readings

at is Nokogiri's equivalent of search(some_node_selector, some_name_space).first. It's all documented in Nokogiri::XML::Node.at. In other words, it finds only the first node and returns it, whereas search finds all nodes that match and return them as a NodeSet.

at accepts a CSS or XPath selector equally. The CSS-specific version is at_css, and the XPath-specific version is at_xpath. I tend to use at unless I'm using a selector that is ambiguous which would fool Nokogiri into doing the wrong thing.

Similarly, search accepts both CSS and XPath, and css and xpath are the CSS and XPath variants respectively.

%Q!...! is another way of defining an interpreted/double-quoted string. Besides %Q there are %q and %, along with %r for regular expressions, %x to execute a command-line application in a sub-shell, and %i which is in Ruby v. 2.0.

Here are a bunch of examples:

foo = 'bar'

%Q[a b]        # => "a b"
%Q^a #{ foo }^ # => "a bar"

%[a b]        # => "a b"
%/a #{ foo }/ # => "a bar"

%q#a b#        # => "a b"
%q[a #{ foo }] # => "a \#{ foo }"

%w$a b$ # => ["a", "b"]
%W~a b~ # => ["a", "b"]

%W[a foo]      # => ["a", "foo"]
%W[a #{ foo }] # => ["a", "bar"]

%r.^foo. # => /^foo/
%r!^foo! # => /^foo/
%r/^foo/ # => /^foo/
%x(date) # => "Mon Dec  2 21:13:37 MST 2013\n"

%s[a]   # => :a
%s[a b] # => :"a b"
%i[a b] # => [:a, :b]

Notice that the delimiters can be book-ends, like () or [], or they can be the same character like # or !. This gives a lot of flexibility when dealing with strings containing both single and double quotes and makes it possible to clean up "leaning-toothpick syndrome" lines:

"He's quoting Shakesphere's \"The Taming of the Shrew\"" # => "He's quoting Shakesphere's \"The Taming of the Shrew\""
'He\'s quoting Shakesphere\'s "The Taming of the Shrew"' # => "He's quoting Shakesphere's \"The Taming of the Shrew\""
%Q[He's quoting Shakesphere's "The Taming of the Shrew"] # => "He's quoting Shakesphere's \"The Taming of the Shrew\""

Notice how the last one is visually a lot cleaner, and a lot easier to type in. Those are just simple examples of embedded single and double quotes. Read through Wikipedia's article on "Leaning Toothpick Syndrome" for more examples and information.

Not sure about the extra ASCII character (I probably just made a typo when submitting). I'm a bit stumped as to exactly where the changes you suggested should go in the program. I understand "what" is happening as far as the each_with as well as assigning the variable to the index but I'm not sure where all of this should be nested. Also this "page.at(%Q!" selector syntax is new to me and I haven't seen it referenced in any of my readings; I would like to read up on the logic some more but IDK where. Thanks so much for your patience and for sharing expertise. — jcuwaz, Dec 01 '13 at 04:26

Workaround for page that uses multiple XPath selectors to define a single link?

2 Answers2