Optimize the python code to remove words from very large dataframe

Question

My aim is to remove rare words from dataframe of size 3 million. Below code is taking very long time. Is there way i can optimize it?

rare_word=[]
for k,v in frequency_word.items():
    if v<=1:
        rare_word.append(k)

df['description']=df['description'].apply(lambda x:[i for i in x if i not in rare_word] )

You can try listcomp to perform the first part (`rare_words = [k for k, v in frequency_word.items() if v <= 1]`). I doubt it will give noticeable benefit, but some - sure. You can also switch to execution in parallel, see [this question](https://stackoverflow.com/questions/45545110/make-pandas-dataframe-apply-use-all-cores). — STerliakov, Jun 25 '22 at 22:46
How big is `rare_word` in practice? Howbig are the list in the dataframe in average? What is a "lakh" ? Is it https://en.wikipedia.org/wiki/Lakh ? Please consider using the international numbering standard. — Jérôme Richard, Jun 25 '22 at 23:06
Welcome to Stack Overflow. Please read [mre] and make sure the *structure* of the Dataframe is clear. Based on the code, I assume that *each cell* of the `'description'` column contains a *list of words*. Is that correct? Next, see what you can find out yourself about the performance of the code. For example, how long does building `rare_word` take? How long does the `.apply` take? About how many rare words are there? — Karl Knechtel, Jun 26 '22 at 00:14

score 1 · Accepted Answer · answered Jun 26 '22 at 11:13

Since rare_word is pretty big, the expression i not in rare_word will be slow because it does a linear search. You can speed this up by converting rare_word to a set with rare_word = set(rare_word). Sets not only perform not in in constant time due to hashing, they also remove the need for expensive string comparisons (also thanks to hashing). You can use a comprehension list to build the set faster:

# Note the presence of the '{}'
rare_word = {k for k,v in frequency_word.items() if v<=1}

It may be possible to optimize the code further but it is hard to say without more information on the dataframe. At least this optimisation should speed up the code by several order of magnitude.

Optimize the python code to remove words from very large dataframe

1 Answers1