How to use regular expression in lxml xpath?

Question

I'm using construction like this:

doc = parse(url).getroot()
links = doc.xpath("//a[text()='some text']")

But I need to select all links which have text beginning with "some text", so I'm wondering is there any way to use regexp here? Didn't find anything in lxml documentation

score 46 · Accepted Answer · answered May 03 '10 at 08:53

46

You can do this (although you don't need regular expressions for the example). Lxml supports regular expressions from the EXSLT extension functions. (see the lxml docs for the XPath class, but it also works for the xpath() method)

doc.xpath("//a[re:match(text(), 'some text')]", 
        namespaces={"re": "http://exslt.org/regular-expressions"})

Note that you need to give the namespace mapping, so that it knows what the "re" prefix in the xpath expression stands for.

answered May 03 '10 at 08:53

Steven

28,002
5
61
51

1

Not working for me, I do: `match(., 'some text')`. By the way I don't quite understand the `.` part. And func `test` has the same result (I think it makes more sense to use `test` actually :P) – lajarre Mar 22 '13 at 20:47
1

[see this](http://stackoverflow.com/a/17293795/786559) if you are tired of passing the namespaces – Ciprian Tomoiagă Feb 28 '17 at 19:39

John Kugelman · Answer 2 · 2020-06-04T20:56:10.083

20

You can use the starts-with() function:

doc.xpath("//a[starts-with(text(),'some text')]")

edited Jun 04 '20 at 20:56

answered May 03 '10 at 03:22

John Kugelman

349,597
67
533
578

score 1 · Answer 3 · answered May 16 '18 at 22:27

1

Because I can't stand lxml's approach to namespaces, I wrote a little method that you cam bind to the HtmlElement class.

Just import HtmlElement:

from lxml.etree import HtmlElement

Then put this in your file:

# Patch the HtmlElement class to add a function that can handle regular
# expressions within XPath queries.
def re_xpath(self, path):
    return self.xpath(path, namespaces={
        're': 'http://exslt.org/regular-expressions'})
HtmlElement.re_xpath = re_xpath

And then when you want to make a regular expression query, just do:

my_node.re_xpath("//a[re:match(text(), 'some text')]")

And you're off to the races. With a little more work, you could probably modify this to replace the xpath method itself, but I haven't bothered since this is working well enough.

answered May 16 '18 at 22:27

mlissner

17,359
18
106
169

1

Import is now `from lxml.html import HtmlElement` – BSalita Jul 27 '20 at 21:17
1

More helpful example with regex searching for `
` `tree.re_xpath("//div[re:match(@id, 'blah_\d+')]")`
– BSalita Jul 27 '20 at 23:15
Can I use something like this to override my_node.find()? I want to insert {*} – Priv Acyplease Mar 25 '21 at 15:58
I don't know why not, @PrivAcyplease. – mlissner Mar 25 '21 at 17:02

score 1 · Answer 4 · answered Jan 08 '21 at 09:06

why don't you just use xpath method starts-with here. you can use this to select specific elements that has text starting with your word something like

doc.xpath("//a[starts-with(text(),'some text')]")

note that if you want to select this element as well

<a href="www.example.com">ends with some text2</a>

its text is not starting with some text but it can be included as well using contains method something like

doc.xpath("//a[contains(text(),'some text')]")

score 0 · Answer 5 · answered Jan 08 '21 at 07:13

The answer is :

doc.xpath("//a[starts-with(text(), 'some')]")

This is the simplest. Usually the simplest is the fast and best.

Suppose we have the following xml and we read it to doc.

from lxml import etree
s="""
<html>
<head><title>Page Title</title></head>
<body>
    <a href="www.example.com">some text</a>
    <a href="www.example.com">some text2</a>
    <a href="www.example.com">ends with some text2</a>
    <a href="www.example.com">other text1</a>
    <a href="www.example.com">other text2</a>
</body>
</html>
"""
doc=etree.fromstring(s)

We than test the speed of the three ways mentioned in previous answers.

time	statement
39.8 µs	doc.xpath("//a[re:match(text(), '^some')]", namespaces={'re': 'http://exslt.org/regular-expressions'})
29.3 µs	doc.xpath("//a[re:test(text(), '^some')]", namespaces={'re': 'http://exslt.org/regular-expressions'})
16.7 µs	doc.xpath("//a[starts-with(text(), 'some')]")

According to the official website here, re:match return an object while re:test only returns a boolean. My guess is re:match must be more complicate than re:test. And when the return value is an object instead of a boolean, more space/memory is needed and so it takes more time to allocate the memory. That is why re:test is faster than re:match. So I am thinking if you just want to check whether a string match a pattern, re:test is enough. Another regular expression function is replace. If you are like me who uses xpath massively in work you should read the document thoroughly likewise. This answers the title of this question, how to use regular expression in lxml xpath.

But regular expression should only be used when simple string functions cannot solve the problem. In your specific case, all that you need is the starts-with function. The time complicity is only O(n), n is the length of the second string. While using regular expression, the algorithm is more complicated. Thus more time is spent.

How to use regular expression in lxml xpath?

5 Answers5

Linked

Related