How to get
that contains text which matches regex

Question

I am trying to scrape this website using scrapy, xpath and regex. I have checked and tried the answers to this question: xpath+ regex: matches text

I want to create a 'scrapy.selector.unified.SelectorList' of <p> that contain the text "11 (sun)" or "9 (fri)" and such, and loop through the list.

event = response.xpath('//p[matches(text(), "\d+\s\(\w{3}\)")]').extract()

does not work.

FYI, below does work.

event = response.xpath('//p[contains(text(), "11 (sun)")]').extract()

What am I missing here?

Thanks, but not working for me and gets the error as below. ValueError: XPath error: Unregistered function in //p[matches(text(), ".*\d+\s$[a-zA-Z]{3}$. *")] — deekay, Nov 21 '18 at 09:42
See [this thread](https://stackoverflow.com/questions/34047567/xpathevalerror-unregistered-function-for-matches-in-lxml), it may help. — Wiktor Stribiżew, Nov 21 '18 at 09:47

score 1 · Answer 1 · answered Nov 21 '18 at 10:18

You can use re() instead of extract() Call the .re() method for each element in this list and return their results flattened, as a list of unicode strings. .re() returns a list of unicode strings. So you can’t construct nested .re() calls.

event = response.xpath('//p/text()').extract("\d+\s$\w{3}$")

Note: re() decode HTML entities (except < and &).

For more information please refer doc here : https://doc.scrapy.org/en/latest/topics/selectors.html#scrapy.selector.SelectorList.re

Thanks for the input, but like stranac mentioned, I am after the elements 'scrapy.selector.unified.SelectorList'. I've modified my question. — deekay, Nov 23 '18 at 08:21

score 1 · Accepted Answer · answered Nov 21 '18 at 16:35

1

If you're only after text, Karan Verma's answer is sufficient.
If you're after the elements themselves, keep reading.

matches is only available in XPath 2.0 and higher (as are the other regex functions), and is not available in scrapy.

Scrapy uses parsel for parsing, which in turn uses lxml, which only supports XPath 1.0.
It does, however, support regular expressions in the EXSLT namespace

Since the regex namespace is enabled by default in scrapy, you can do this:

event = response.xpath('//p[re:match(text(), "\d+\s\(\w{3}\)")]')

answered Nov 21 '18 at 16:35

stranac

26,638
5
25
30

Thank you stranac for the answer. This seems the answer I was looking for, but it returns empty list. Regex seems not matching the text I'm targeting. If I use ".*" it returns all potential
. Any advice on the regex to grab 11 (sun), 12 (mon), 13 (tue) and such? Thanks in advance.
– deekay Nov 23 '18 at 08:24
Apologies, I was wrong with the scrapy shell url, forgot to include $(date + %Y%m) to get YYYYMM strings in the path. It worked just fine. Thanks for the great answer. – deekay Dec 04 '18 at 09:05

How to get that contains text which matches regex

2 Answers2

How to get
that contains text which matches regex