The answer is :
doc.xpath("//a[starts-with(text(), 'some')]")
This is the simplest. Usually the simplest is the fast and best.
Suppose we have the following xml and we read it to doc.
from lxml import etree
s="""
<html>
<head><title>Page Title</title></head>
<body>
<a href="www.example.com">some text</a>
<a href="www.example.com">some text2</a>
<a href="www.example.com">ends with some text2</a>
<a href="www.example.com">other text1</a>
<a href="www.example.com">other text2</a>
</body>
</html>
"""
doc=etree.fromstring(s)
We than test the speed of the three ways mentioned in previous answers.
time |
statement |
39.8 µs |
doc.xpath("//a[re:match(text(), '^some')]", namespaces={'re': 'http://exslt.org/regular-expressions'}) |
29.3 µs |
doc.xpath("//a[re:test(text(), '^some')]", namespaces={'re': 'http://exslt.org/regular-expressions'}) |
16.7 µs |
doc.xpath("//a[starts-with(text(), 'some')]") |
According to the official website here, re:match return an object while re:test only returns a boolean. My guess is re:match must be more complicate than re:test. And when the return value is an object instead of a boolean, more space/memory is needed and so it takes more time to allocate the memory. That is why re:test is faster than re:match. So I am thinking if you just want to check whether a string match a pattern, re:test is enough. Another regular expression function is replace. If you are like me who uses xpath massively in work you should read the document thoroughly likewise. This answers the title of this question, how to use regular expression in lxml xpath.
But regular expression should only be used when simple string functions cannot solve the problem. In your specific case, all that you need is the starts-with function. The time complicity is only O(n), n is the length of the second string. While using regular expression, the algorithm is more complicated. Thus more time is spent.
More about this topic:
from xpath 2.0, regular expression will be available without using exslt. But lxml only support xpath 1.0.
here is the w3 website:
https://www.w3.org/TR/xpath-functions/#string.match