1

While creating a scraper, I encountered a situation where I have a bunch of keywords and there are too many to hard code. So I wanted to implement a regular expression that reads from a "dictionary" file, it contains the keywords, and when the crawler / scraper matches one of the keywords on a certain website it scrapes the whole paragraph containing the keywords.

A single paragraph scraping model of the code is looking like this :

for Keyword in response.xpath('//*'):
        yield {
            'dictA':  Keyword.xpath('//p/text()[contains(..,"Specific Keyword/s")]').extract(),             
        }

This is what gets me the whole paragraph that this "Specific Keyword/s" contains. But I have, let's say around 100 words, I don't want to do:

dictA1
.
.
.
dictA100

It is inefficient. How could I go behind this. As always hints and pointing helps and is welcome.

Schneejäger
  • 231
  • 3
  • 15

1 Answers1

1

If you want to process list of keywords and check each one against some XPath expression you can use this: for Keyword in response.xpath('//*'):

for specific_keyword in keyword_list:
    yield {
        'dict':  Keyword.xpath( '//p/text()[contains(.,"{0}")]'.format(specific_keyword) ).extract(),             
    }

UPDATE After some clarifications from you:

for word in keyword_list:
    for para_text in response.xpath('//p/text()[contains(..,"{0}")]'.format(word)).extract():        
        yield {
            'dict':  para_text,             
        }
gangabass
  • 10,607
  • 2
  • 23
  • 35
  • So, help me understand. The specific_keyword would be a word from the file (.csv if it helps)? Because from what I get it would translate as `for word_in_file in file: yield text that contains word_in_file`? If that is the case, then I need a way to make this with 100 keywords, not just a single one. But I may have misunderstood. – Schneejäger May 22 '18 at 13:00
  • @schneejäger first you need to read your keywords (from a CSV file, database or something else) into `keywords_list` – gangabass May 22 '18 at 13:03
  • By that you mean to `keyword_list: open("file.csv", "rt")`, yes? – Schneejäger May 22 '18 at 13:06
  • like this: https://stackoverflow.com/questions/3277503/in-python-how-do-i-read-a-file-line-by-line-into-a-list?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa – gangabass May 22 '18 at 13:07
  • It throws an error `'str' object has no attribute 'xpath' ` because there isn't a response.xpath to guide it. Maybe I'm applying it wrong to my example. – Schneejäger May 22 '18 at 13:21
  • I think you're trying to apply `.xpath()` to your keyword... Please show your current code – gangabass May 22 '18 at 13:22
  • ` def parse(self, response): in_file = tuple(open("dictionaryA.csv", "r")) for link in response.xpath('/html/body/div[1]/div/main/article/div/div[2]'): yield { 'dictA1': link.xpath('//p/text()[contains(..,"Pasta")]').extract(), }` This is the one working. `for word in in_file: yield { 'dictA': word.xpath('//p/text()[contains(.,"{0}")]'.format(word)).extract(), }` This would be with what you have provided me. – Schneejäger May 22 '18 at 13:26
  • Please show your HTML and explain what did you mean by `Keyword in response.xpath('//*'):` – gangabass May 22 '18 at 13:29
  • `link in response.xpath('/html/body/div[1]/div/main/article/div/div[2]'):` this would be the `Keyword in response.xpath('//*'):` you refer to, I changed some names in the original question. – Schneejäger May 22 '18 at 13:31
  • We've advanced, got a new problem `ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters`. – Schneejäger May 22 '18 at 14:01
  • I can't code for you. Obviously your last error doesn't related to your original question – gangabass May 22 '18 at 14:05
  • Obviously, just wanted to let you know of the advancement. Thank you! – Schneejäger May 22 '18 at 14:09