Checking Text for The Presence of a Large Set of Keywords

Question

Suppose that I want to check a webpage for the presence of an arbitrarily large number of keywords. How would I go about doing that?

I've tested the xpath selector if response.xpath('//*[text()[contains(.,"red") or contains(.,"blue") or contains(.,”green”)]]'): and it works as expected. The actual set of keywords that I'm interested in checking for is too large to conveniently enter by hand, as above. What I'm interested in is a way to automate that process by generating my selector based on the contents of a file filled with key words.

Starting from a text file with each keyword on its own line, how could I open that file and use it to check whether the keywords it contains appear in the text elements of a given xpath?

I used the threads Xpath contains value A or value B and XPATH Multiple Element Filters to come up with my manual entry solution, but haven't found anything that addresses automation.

Clarification

I'm not interested in just checking to see whether a given xpath contains any of the keywords provided in my list. I also want to use their presence as a precondition for scraping content from the page. The manual system that I've tested works as follows:

item_info = ItemLoader(item=info_categories(), response=response)
if response.xpath('//*[text()[contains(.,"red") or contains(.,"blue") or contains(.,”green”)]]'):
    item_info.add_xpath('title', './/some/x/path/text()')
    item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()

While @alecxe's solution allows me to check the text of a page against a keyword set, switching from 'print' to 'if' and attempting to control the information I extract returns SyntaxError: invalid syntax. Can I combine the convenience of reading in keywords from a list with the function of manually entering them?

Update—exploring Frederic Bazin's regex solution

Over the past few days I've been working with a regex approach to limiting my parse. My code, which adopts Frederic's proposal with a few modifications to account for errors, is as follows:

item_info = ItemLoader(item=info_categories(), response=response)
keywords = '|'.join(re.escape(word.strip()) for word in open('keys.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.match(response.body_as_unicode()):
    item_info.add_xpath('title', './/some/x/path/text()')
    item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()

This code runs without errors, but Scrapy reports 0 items crawled and 0 items scraped, so something is clearly going wrong.

I've attempted to debug by running this from the Scrapy shell. My results there suggest that the keywords and r steps are both behaving. If I define and call keywords using the method above for a .txt file containing the words red, blue, and green, I receive in response 'red|blue|green'. Defining and calling r as above gives me <_sre.SRE_Pattern object at 0x17bc980>, which I believe is the expected response. When I run r.match(response.body_as_unicode()), however, I receive no response, even on pages that I know contain one or more of my keywords.

Does anyone have thoughts as to what I'm missing here? As I understand it, whenever one of my keywords appears in the response.body, a match should be triggered and Scrapy should proceed to extract information from that response using the xpaths I've defined. Clearly I'm mistaken, but I'm not sure how or why.

Solution?

I think I may have this problem figured out at last. My current conclusion is that the difficulty was caused by performing r.match on the response.body_as_unicode. The documentation provided here says of match:

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

That behaviour was not appropriate to my situation. I'm interested in identifying and scraping information from pages that contain my keywords anywhere within them, not those that feature one of my keywords as the first item on the page. To accomplish that task, I needed re.search, which scans through a string until it finds a match for the regex pattern generated by compile and returns a MatchObject, or else returns None when no match for the pattern.

My current (working!) code follows below. Note that in addition to the switch from match to search I've added a little bit to my definition of keywords to limit matches to whole words.

item_info = ItemLoader(item=info_categories(), response=response)
keywords = '|'.join(r"\b" + re.escape(word.strip()) + r"\b" for word in open('keys.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
    item_info.add_xpath('title', './/some/x/path/text()')
    item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()

Frederic Bazin · Answer 1 · 2015-08-23T15:22:45.380

1

regex is probably the fastest way to run the tests on a large number of page

import re
keywords = '|'.join(re.escape(word.strip()) for word in open('keywords.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.match(response.body_as_unicode()):

generating xpath expression on multiple keywords could work but you add the extra CPU load ( typically ~100ms) of parsing the page as XML before running XPATH.

edited Aug 23 '15 at 15:22

answered Aug 09 '15 at 14:34

Frederic Bazin

1,530
12
27

Thanks, this looks promising. I'm having a bit of trouble getting your solution to run so that I can test it, however. The line where you define keywords returns the error `TypeError: "'builtin_function_or_method' object is not iterable"`, which I gather from this post [link](http://stackoverflow.com/questions/30145926/main-loop-builtin-function-or-method-object-is-not-iterable) means that a method is being called directly. Unfortunately, I'm having trouble seeing where. – Tric Aug 10 '15 at 01:42
I've been experimenting a bit since I hit the error last night and changing to `word.strip()` and `response.body_as_unicode()` respectively deals with the error I mentioned above and a `TypeError: "expected string or buffer" that's triggered by using `word.strip()` alone. Unfortunately, those changes also seem to break the parse method that I tested previously. The debug responses indicate that it crawls all the responses I expect, but it now returns `Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)`. – Tric Aug 10 '15 at 19:24
I fixed bugs based on your feedback. it seems you resolved earlier anyway. I hope you could measure significant performance improvement with this method ? – Frederic Bazin Aug 23 '15 at 15:29

alecxe · Answer 2 · 2015-08-09T03:53:41.890

0

You can also check if a keyword is inside the response.body:

source = response.body
with open('input.txt') as f:
    for word in f:
        print word, word.strip() in source

Or, using any():

with open('input.txt') as f:
    print any(word.strip() in source for word in f)

edited Aug 09 '15 at 03:53

answered Aug 09 '15 at 00:29

alecxe

462,703
120
1,088
1,195

Thanks for your response! The open file -> read word in file approach is much less circuitous that what I was imagining having to do. However, my wording in the initial imprecise; I don't want to check for the presence of a keyword set the presence of at least one of them as the condition for my parse. This—broken—code might give you a clearer picture of what I'm aiming for: with open('keys.txt') as keyword_list: if response.xpath('//*[text()[contains(., word in keyword_list)]]'): – Tric Aug 09 '15 at 03:49
Thanks for the update. It runs and tells me whether one of the words in my list are matched in the response.body; that's great, but it's not quite what I'm looking for. I'd like to impose the presence of a keyword as a condition for scraping data in the first place. I'll update my original question to clarify this point and provide a bit of context. On a side note, why define 'source' instead of writing ` print any(word.strip() in response.body for word in f)`? – Tric Aug 09 '15 at 05:57

Checking Text for The Presence of a Large Set of Keywords

2 Answers2