DISCLAIMER I'm the author of trrex
For the case of the exact matching, one approach for solving this, is to use a Trie, as mentioned in the comments. trrex is a library that makes a Trie-Regex (a Trie in regex format) that can be used in conjunction with the regular expression engine of Python:
import random
import pandas as pd
import trrex as tx
import re
df = pd.read_csv('jeopardy-small.csv')
with open('words-sample') as infile:
words = [line.strip() for line in infile]
tuples = [(random.randint(1, 250), sentence) for sentence in df['question']]
def fun_kislyuk(ws, ts):
return {t[0] for t in ts if any(w in t[1] for w in ws)}
def fun_trrex(ws, ts):
pattern = re.compile(tx.make(ws, left='', right=''))
return {i for i, s in ts if pattern.search(s)}
if __name__ == "__main__":
print(fun_trrex(words, tuples) == fun_kislyuk(words, tuples))
Output
True
The timings for the above functions are:
%timeit fun_trrex(words, tuples)
11.3 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_kislyuk(words, tuples)
67.5 ms ± 1.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The data is a list of around 2K questions from jeopardy, and 500 randomly chosen words. You can find here the resources for reproducing the experiments.
UPDATE
If you add the grouping strategy mentioned in the comments the time improvements increases, below is the function:
def fun_grouping_trrex(ws, ts):
pattern = re.compile(tx.make(ws, left='', right=''))
groups = defaultdict(list)
for i, s in ts:
groups[i].append(s)
return {i for i, vs in groups.items() if any(pattern.search(v) for v in vs)}
and the timings:
%timeit fun_trrex(words, tuples)
11.2 ms ± 61.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_grouping_trrex(words, tuples)
4.96 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit fun_kislyuk(words, tuples)
67.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The approach of grouping + trrex gives you an approximated 10 times improvement on performance. But take this last result with a grain of salt because it's very dependent on the dataset.