Sets vs. Regex for string lookup, which is more scalable?

Question

Suppose that I need to handle a very big list of words, and I need to count the number of times I find any of those words in a piece of text I have. Which is the best option in terms of scalability?

Option I (regex)

>>> import re
>>> s = re.compile("|".join(big_list))
>>> len(s.find_all(sentence))

Option II (sets)

>>> s = set(big_list)
>>> len([word for word in sentence.split(" ") if word in s]) # O(1) avg lookup time

Example: if the list is ["cat","dog","knee"] and the text is "the dog jumped over the cat, but the dog broke his knee" the final result should be: 4

P.S. Any other option is welcome

Note that your two options will return different results even on your test data. (the `set` option won't pick up `'cat,'` whereas the regex will). — mgilson, Apr 29 '13 at 00:14
[Aho–Corasick](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) is fast and clever, but I don't know of a Python implementation. — msw, Apr 29 '13 at 00:15
@msw: There's a link to one at the bottom of the article: http://papercruncher.com/2012/02/26/string-searching-using-aho-corasick/ — Blender, Apr 29 '13 at 00:18
I might use the `set` version but iterate over `re.finditer(r'\b\w+\b',sentence)` (but this is dependent on your list of words being alphanumeric). And I might use the old standby `sum( 1 for x in ... if ... )` rather than using `len`. — mgilson, Apr 29 '13 at 00:18

mgilson · Accepted Answer · 2013-04-29T13:53:33.190

2

If your words are alphanumeric, I might use something like:

s = set(big_list)
sum(1 for x in re.finditer(r'\b\w+\b',sentence) if x.group() in s)

Since the membership test for a set is on average O(1), this algorithm becomes O(N+M) where N is the number of words in the sentence and M is the number of elements in big_list. Not too shabby. It also does pretty well in terms of memory usage.

edited Apr 29 '13 at 13:53

answered Apr 29 '13 at 00:24

mgilson

300,191
65
633
696

score 0 · Answer 2 · edited May 23 '17 at 11:57

0

A scalable method would be sorting the input dictionary and words from the text then doing the matching using two iterators. You can also use a trie for even better performance. I don't know the internal representation of the set, however, using a large regex would be a total overkill.

edited May 23 '17 at 11:57

Community

1
1

answered Apr 29 '13 at 00:14

perreal

94,503
21
155
181

Sets vs. Regex for string lookup, which is more scalable?

2 Answers2