1

I've selected the element in a page that has the links I want to the pages. They are sort of like <a href="blah">1</a>. I want to use regex with xpath so that I can get all the links like that one, whose text is \d+.

I see there is an answer for it here: How to use regular expression in lxml xpath? but I can't make sense out of it.

More specifically, "Note that you need to give the namespace mapping, so that it knows what the "re" prefix in the xpath expression stands for."

Here's the code from the page cleaned up: <div class="pagination"> <b>1</b> <a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=25">2</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=50">3</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=75">4</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=100">5</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=125">6</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=150">7</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=175">8</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=200">9</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=225">10</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=250">11</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=25" alt="next">›</a><a href="?page=post&amp;s=list&amp;tags=tag2+tag1&amp;pid=325" alt="last page">»</a><br><br><br><br><iframe hspace="0" vspace="0" border="0" marginheight="0" marginwidth="0" allowtransparency="true" src="http://notrelevant.com" frameborder="0" height="98" scrolling="no" width="736"></iframe></div>

My code so far:

answer = browser.open(address)
tree = lxml.html.parse(answer)
numbers = tree.xpath("//div[contains(@class, 'pagination')]")[0]
Community
  • 1
  • 1
rdernga
  • 13
  • 3

2 Answers2

2

You don't need RegExp for this XPath expression:

//div[
   contains(
      concat(' ',@class,' '),
      ' pagination '
   )
]/a[
  floor()=.
]
2

XPath does not provide a means to match a regexp.

The extension used in the post to which you link should allow the following to work, though:

//div[contains(@class, 'pagination')]/a[re:match(text(), '^\d+$')]
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • 1
    You wrote _"XPath does not provide a means to match a regexp"_. **That's wrong**. This is the last XPath specification http://www.w3.org/TR/xpath20/ –  Apr 18 '11 at 13:07