Suppose that I need to handle a very big list of words, and I need to count the number of times I find any of those words in a piece of text I have. Which is the best option in terms of scalability?
Option I (regex)
>>> import re
>>> s = re.compile("|".join(big_list))
>>> len(s.find_all(sentence))
Option II (sets)
>>> s = set(big_list)
>>> len([word for word in sentence.split(" ") if word in s]) # O(1) avg lookup time
Example: if the list is ["cat","dog","knee"] and the text is "the dog jumped over the cat, but the dog broke his knee" the final result should be: 4
P.S. Any other option is welcome