Remove tokens of a list if they are in another list (improve speed)

Question

I have a list of lists composed by tokenized texts. The list lenght is around 1.200.000 texts. An example of this list is shown below:

texts = [
        ['hi', 'how', 'are', 'you'],
        ['i', 'am', 'fine', 'thank', 'you'],
        ...

I'm trying to remove for each list words that appear in another list. This is a list which is composed around 90.000 words and it is seemed to the next:

removing_words = ['ok', 'bye', 'hi', ...]

My code to do this is:

texts = [[token for token in text if token not in removing_words] for text in texts]

It works fine, but it is very, very slow. Any idea of how can I improve this? Thank you so much!

Make `removing_words` into a set should speed up your code quite a bit. — Frank Yellin, Dec 14 '20 at 00:02
Does this answer your question? [Faster way to remove stop words in Python](https://stackoverflow.com/questions/19560498/faster-way-to-remove-stop-words-in-python) — Rajat Mishra, Dec 14 '20 at 00:26
@FrankYellin thank you very much. It works. Please, close this post. — varzor23, Dec 14 '20 at 10:00

score 0 · Answer 1 · answered Dec 14 '20 at 00:41

I would look at how the tokens are generated. Try to create a dictionary of all tokens and its frequency. The dict would keep a count of how many occurrences the token appears. The keys of the dictionary would be unique().

#### PASS 1 - create frequency dictionary
FreqDict = defaultdict(int)
for tList in texts:
    for token in tList: FreqDict[token] +=1 
print(FreqDict)

#### PASS2 - Remove tokens > 1 
newtexts = [' '.join(['' if FreqDict[token] != 1 else token for token in tList]).split() for tList in texts]

Remove tokens of a list if they are in another list (improve speed)

1 Answers1