XPath to find all links with just numbers in them?

Question

I've selected the element in a page that has the links I want to the pages. They are sort of like <a href="blah">1</a>. I want to use regex with xpath so that I can get all the links like that one, whose text is \d+.

I see there is an answer for it here: How to use regular expression in lxml xpath? but I can't make sense out of it.

More specifically, "Note that you need to give the namespace mapping, so that it knows what the "re" prefix in the xpath expression stands for."

Here's the code from the page cleaned up: <div class="pagination"> <b>1</b> <a href="?page=post&s=list&tags=tag2+tag1&pid=25">2</a><a href="?page=post&s=list&tags=tag2+tag1&pid=50">3</a><a href="?page=post&s=list&tags=tag2+tag1&pid=75">4</a><a href="?page=post&s=list&tags=tag2+tag1&pid=100">5</a><a href="?page=post&s=list&tags=tag2+tag1&pid=125">6</a><a href="?page=post&s=list&tags=tag2+tag1&pid=150">7</a><a href="?page=post&s=list&tags=tag2+tag1&pid=175">8</a><a href="?page=post&s=list&tags=tag2+tag1&pid=200">9</a><a href="?page=post&s=list&tags=tag2+tag1&pid=225">10</a><a href="?page=post&s=list&tags=tag2+tag1&pid=250">11</a><a href="?page=post&s=list&tags=tag2+tag1&pid=25" alt="next">›</a><a href="?page=post&s=list&tags=tag2+tag1&pid=325" alt="last page">»</a><br><br><br><br><iframe hspace="0" vspace="0" border="0" marginheight="0" marginwidth="0" allowtransparency="true" src="http://notrelevant.com" frameborder="0" height="98" scrolling="no" width="736"></iframe></div>

My code so far:

answer = browser.open(address)
tree = lxml.html.parse(answer)
numbers = tree.xpath("//div[contains(@class, 'pagination')]")[0]

score 2 · Answer 1 · answered Apr 18 '11 at 01:27

2

You don't need RegExp for this XPath expression:

//div[
   contains(
      concat(' ',@class,' '),
      ' pagination '
   )
]/a[
  floor()=.
]

answered Apr 18 '11 at 01:27

score 2 · Accepted Answer · answered Apr 18 '11 at 01:28

2

XPath does not provide a means to match a regexp.

The extension used in the post to which you link should allow the following to work, though:

//div[contains(@class, 'pagination')]/a[re:match(text(), '^\d+$')]

answered Apr 18 '11 at 01:28

ikegami

367,544
15
269
518

1

You wrote _"XPath does not provide a means to match a regexp"_. **That's wrong**. This is the last XPath specification http://www.w3.org/TR/xpath20/ – Apr 18 '11 at 13:07

XPath to find all links with just numbers in them?

2 Answers2