2

Im wondering what would be the most eficient way in order to find if a text that has been scraped using Scrapy contains a word that is in a predefined list. Important to note that the list could be of around ~200 words and the text could be from hundreds of websites so efficiency is important.

My current solution with only a couple of words in list would be:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BookSpider(CrawlSpider):
    name = 'book'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com']

    rules = (
        Rule(LinkExtractor(), callback='parse', follow=True),
    )

    def parse(self, response):
        restricted = ['word', 'word1', 'word2']
        text = response.xpath("//body//text()").getall()

        for words in restricted:
            if words in text:
                print('Found a restricted word!')
            else:
                print('All good!')

What do you think of such a solution? Maybe there is a more efficient way of achievieng the goal?

Teymour
  • 1,832
  • 1
  • 13
  • 34

1 Answers1

0

For a "pure" in/not in check, use set.intersection. Creating a set of the bigger text (if you can hold it in memory) will speed up this task tremendously.

A set reduces the amount of words to be checkt to unique checks and the check itself is O(1) - that is about as fast as you can get:

from urllib.request import urlopen

# use from disc, else get once from url and save to disc to use it
try:
    with open("faust.txt") as f:
        data = f.read()
except:
    # partial credit: https://stackoverflow.com/a/46124819/7505395

    # get some freebe text - Goethes Faust should suffice
    url = "https://archive.org/stream/fausttragedy00goetuoft/fausttragedy00goetuoft_djvu.txt"
    data = urlopen(url).read()
    with open("faust.txt", "wb") as f:
        f.write(data)

Process the data for measurements:

words = data.split()  # words: 202915
unique = set(words)   # distinct words: 34809

none_true = {"NoWayThatsInIt_1", "NoWayThatsInIt_2", "NoWayThatsInIt_3", "NoWayThatsInIt_4"}
one_true = none_true | {"foul"}

# should use timeit for it, havent got it here
def sloppy_time_measure(f, text):
    import time
    print(text, end="")
    t = time.time()
    # execute function 1000 times
    for _ in range(1000):
        f()
    print( (time.time() - t) * 1000, "milliseconds" )

# .intersection calculates _full_ intersection, not only an "in" check:
lw = len(words)
ls = len(unique)
sloppy_time_measure(lambda: none_true.intersection(words), f"Find none in list of {lw} words: ")
sloppy_time_measure(lambda: one_true.intersection(words), f"Find one  in list of {lw} words: ")
sloppy_time_measure(lambda: any(w in words for w in none_true), 
                            f"Find none using 'in' in list of {lw} words: ")

sloppy_time_measure(lambda: none_true.intersection(unique), f"Find none in set of {ls} uniques: ")
sloppy_time_measure(lambda: one_true.intersection(unique), f"Find one  in set of {ls} uniques: ")
sloppy_time_measure(lambda: any(w in unique for w in one_true), 
                            f"Find one  using 'in' in set of {ls} uniques: ")

Outputs for 1000 applications of the search (added spacing for clarity):

# in list
Find none in list of 202921 words:             5038.942813873291 milliseconds
Find one  in list of 202921 words:             4234.968662261963 milliseconds
Find none using 'in' in list of 202921 words:  9726.848363876343 milliseconds

# in set
Find none in set of 34809 uniques:               15.897989273071289 milliseconds
Find one  in set of 34809 uniques:               11.409759521484375 milliseconds
Find one  using 'in' in set of 34809 uniques:    39.183855056762695 milliseconds
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69