Python scrapy and Regex check string from file and scrape

Question

While creating a scraper, I encountered a situation where I have a bunch of keywords and there are too many to hard code. So I wanted to implement a regular expression that reads from a "dictionary" file, it contains the keywords, and when the crawler / scraper matches one of the keywords on a certain website it scrapes the whole paragraph containing the keywords.

A single paragraph scraping model of the code is looking like this :

for Keyword in response.xpath('//*'):
        yield {
            'dictA':  Keyword.xpath('//p/text()[contains(..,"Specific Keyword/s")]').extract(),             
        }

This is what gets me the whole paragraph that this "Specific Keyword/s" contains. But I have, let's say around 100 words, I don't want to do:

dictA1
.
.
.
dictA100

It is inefficient. How could I go behind this. As always hints and pointing helps and is welcome.

gangabass · Accepted Answer · 2018-05-22T13:33:40.083

1

If you want to process list of keywords and check each one against some XPath expression you can use this: for Keyword in response.xpath('//*'):

for specific_keyword in keyword_list:
    yield {
        'dict':  Keyword.xpath( '//p/text()[contains(.,"{0}")]'.format(specific_keyword) ).extract(),             
    }

UPDATE After some clarifications from you:

for word in keyword_list:
    for para_text in response.xpath('//p/text()[contains(..,"{0}")]'.format(word)).extract():        
        yield {
            'dict':  para_text,             
        }

edited May 22 '18 at 13:33

answered May 22 '18 at 12:52

gangabass

10,607
2
23
35

So, help me understand. The specific_keyword would be a word from the file (.csv if it helps)? Because from what I get it would translate as `for word_in_file in file: yield text that contains word_in_file`? If that is the case, then I need a way to make this with 100 keywords, not just a single one. But I may have misunderstood. – Schneejäger May 22 '18 at 13:00
@schneejäger first you need to read your keywords (from a CSV file, database or something else) into `keywords_list` – gangabass May 22 '18 at 13:03
By that you mean to `keyword_list: open("file.csv", "rt")`, yes? – Schneejäger May 22 '18 at 13:06
like this: https://stackoverflow.com/questions/3277503/in-python-how-do-i-read-a-file-line-by-line-into-a-list?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa – gangabass May 22 '18 at 13:07
It throws an error `'str' object has no attribute 'xpath' ` because there isn't a response.xpath to guide it. Maybe I'm applying it wrong to my example. – Schneejäger May 22 '18 at 13:21
I think you're trying to apply `.xpath()` to your keyword... Please show your current code – gangabass May 22 '18 at 13:22
` def parse(self, response): in_file = tuple(open("dictionaryA.csv", "r")) for link in response.xpath('/html/body/div[1]/div/main/article/div/div[2]'): yield { 'dictA1': link.xpath('//p/text()[contains(..,"Pasta")]').extract(), }` This is the one working. `for word in in_file: yield { 'dictA': word.xpath('//p/text()[contains(.,"{0}")]'.format(word)).extract(), }` This would be with what you have provided me. – Schneejäger May 22 '18 at 13:26
Please show your HTML and explain what did you mean by `Keyword in response.xpath('//*'):` – gangabass May 22 '18 at 13:29
`link in response.xpath('/html/body/div[1]/div/main/article/div/div[2]'):` this would be the `Keyword in response.xpath('//*'):` you refer to, I changed some names in the original question. – Schneejäger May 22 '18 at 13:31
We've advanced, got a new problem `ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters`. – Schneejäger May 22 '18 at 14:01
I can't code for you. Obviously your last error doesn't related to your original question – gangabass May 22 '18 at 14:05
Obviously, just wanted to let you know of the advancement. Thank you! – Schneejäger May 22 '18 at 14:09

Python scrapy and Regex check string from file and scrape

1 Answers1