1

Suppose that I need to handle a very big list of words, and I need to count the number of times I find any of those words in a piece of text I have. Which is the best option in terms of scalability?

Option I (regex)

>>> import re
>>> s = re.compile("|".join(big_list))
>>> len(s.find_all(sentence))

Option II (sets)

>>> s = set(big_list)
>>> len([word for word in sentence.split(" ") if word in s]) # O(1) avg lookup time

Example: if the list is ["cat","dog","knee"] and the text is "the dog jumped over the cat, but the dog broke his knee" the final result should be: 4

P.S. Any other option is welcome

luke14free
  • 2,529
  • 1
  • 17
  • 25
  • Note that your two options will return different results even on your test data. (the `set` option won't pick up `'cat,'` whereas the regex will). – mgilson Apr 29 '13 at 00:14
  • [Aho–Corasick](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) is fast and clever, but I don't know of a Python implementation. – msw Apr 29 '13 at 00:15
  • @msw: There's a link to one at the bottom of the article: http://papercruncher.com/2012/02/26/string-searching-using-aho-corasick/ – Blender Apr 29 '13 at 00:18
  • I might use the `set` version but iterate over `re.finditer(r'\b\w+\b',sentence)` (but this is dependent on your list of words being alphanumeric). And I might use the old standby `sum( 1 for x in ... if ... )` rather than using `len`. – mgilson Apr 29 '13 at 00:18

2 Answers2

2

If your words are alphanumeric, I might use something like:

s = set(big_list)
sum(1 for x in re.finditer(r'\b\w+\b',sentence) if x.group() in s)

Since the membership test for a set is on average O(1), this algorithm becomes O(N+M) where N is the number of words in the sentence and M is the number of elements in big_list. Not too shabby. It also does pretty well in terms of memory usage.

mgilson
  • 300,191
  • 65
  • 633
  • 696
0

A scalable method would be sorting the input dictionary and words from the text then doing the matching using two iterators. You can also use a trie for even better performance. I don't know the internal representation of the set, however, using a large regex would be a total overkill.

Community
  • 1
  • 1
perreal
  • 94,503
  • 21
  • 155
  • 181