Suppose that I want to check a webpage for the presence of an arbitrarily large number of keywords. How would I go about doing that?
I've tested the xpath selector if response.xpath('//*[text()[contains(.,"red") or contains(.,"blue") or contains(.,”green”)]]'):
and it works as expected. The actual set of keywords that I'm interested in checking for is too large to conveniently enter by hand, as above. What I'm interested in is a way to automate that process by generating my selector based on the contents of a file filled with key words.
Starting from a text file with each keyword on its own line, how could I open that file and use it to check whether the keywords it contains appear in the text elements of a given xpath?
I used the threads Xpath contains value A or value B and XPATH Multiple Element Filters to come up with my manual entry solution, but haven't found anything that addresses automation.
Clarification
I'm not interested in just checking to see whether a given xpath contains any of the keywords provided in my list. I also want to use their presence as a precondition for scraping content from the page. The manual system that I've tested works as follows:
item_info = ItemLoader(item=info_categories(), response=response)
if response.xpath('//*[text()[contains(.,"red") or contains(.,"blue") or contains(.,”green”)]]'):
item_info.add_xpath('title', './/some/x/path/text()')
item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()
While @alecxe's solution allows me to check the text of a page against a keyword set, switching from 'print' to 'if' and attempting to control the information I extract returns SyntaxError: invalid syntax
. Can I combine the convenience of reading in keywords from a list with the function of manually entering them?
Update—exploring Frederic Bazin's regex solution
Over the past few days I've been working with a regex approach to limiting my parse. My code, which adopts Frederic's proposal with a few modifications to account for errors, is as follows:
item_info = ItemLoader(item=info_categories(), response=response)
keywords = '|'.join(re.escape(word.strip()) for word in open('keys.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.match(response.body_as_unicode()):
item_info.add_xpath('title', './/some/x/path/text()')
item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()
This code runs without errors, but Scrapy reports 0 items crawled and 0 items scraped, so something is clearly going wrong.
I've attempted to debug by running this from the Scrapy shell. My results there suggest that the keywords
and r
steps are both behaving. If I define and call keywords
using the method above for a .txt file containing the words red, blue, and green, I receive in response 'red|blue|green'
. Defining and calling r
as above gives me <_sre.SRE_Pattern object at 0x17bc980>
, which I believe is the expected response. When I run r.match(response.body_as_unicode())
, however, I receive no response, even on pages that I know contain one or more of my keywords.
Does anyone have thoughts as to what I'm missing here? As I understand it, whenever one of my keywords appears in the response.body, a match should be triggered and Scrapy should proceed to extract information from that response using the xpaths I've defined. Clearly I'm mistaken, but I'm not sure how or why.
Solution?
I think I may have this problem figured out at last. My current conclusion is that the difficulty was caused by performing r.match
on the response.body_as_unicode
. The documentation provided here says of match:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
That behaviour was not appropriate to my situation. I'm interested in identifying and scraping information from pages that contain my keywords anywhere within them, not those that feature one of my keywords as the first item on the page. To accomplish that task, I needed re.search
, which scans through a string until it finds a match for the regex pattern generated by compile
and returns a MatchObject
, or else returns None
when no match for the pattern.
My current (working!) code follows below. Note that in addition to the switch from match
to search
I've added a little bit to my definition of keywords to limit matches to whole words.
item_info = ItemLoader(item=info_categories(), response=response)
keywords = '|'.join(r"\b" + re.escape(word.strip()) + r"\b" for word in open('keys.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
item_info.add_xpath('title', './/some/x/path/text()')
item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()