There is a list of sentencens sentences = ['Ask the swordsmith', 'He knows everything']
. The goal is to remove those sentences that a word from a wordlist lexicon = ['word', 'every', 'thing']
. This can be achieved using the following list comprehension:
newlist = [sentence for sentence in sentences if not any(word in sentence.split(' ') for word in lexicon)]
Note that if not word in sentence
is not a sufficient condition as it would also remove sentences that contain words in which a word from the lexicon is embedded, e.g. word
is embedded in swordsmith
, and every
and thing
are embedded in everything
.
However, my list of sentences consists of 1.000.000 sentences and my lexicon of 200.000 words. Applying the list comprehension mentioned takes hours! Because of that, I'm looking for a faster method to remove strings from a list that contain words from another list. Any suggestions? Maybe using regex?