python nltk processing with text, remove stopwords quickly

Question

I'm using nltk for processing text data. When I want to use stopwords, I usually use this code.

text_clean = [w for w in text if w.lower() not in stopwords]

But this code always takes too long.(Maybe my data is too big...)
Is there any method to reduce the time? Thanks.

stopwords is a set or a list? – Dani Mesejo Oct 09 '18 at 17:34 — Dani Mesejo, Oct 09 '18 at 17:34
I'm sorry, it is list. – justin_sakong Oct 09 '18 at 17:45 — justin_sakong, Oct 09 '18 at 17:45

Dani Mesejo · Accepted Answer · 2018-10-10T11:27:57.680

Try converting stopwords to a set. Using a list, your approach is O(n*m) where n is the number of words in text and m is the number of stop-words, using a set the approach is O(n + m). Let's compare both approaches list vs set:

import timeit
from nltk.corpus import stopwords


def list_clean(text):
    stop_words = stopwords.words('english')
    return [w for w in text if w.lower() not in stop_words]


def set_clean(text):
    set_stop_words = set(stopwords.words('english'))
    return [w for w in text if w.lower() not in set_stop_words]

text = ['the', 'cat', 'is', 'on', 'the', 'table', 'that', 'is', 'in', 'some', 'room'] * 100000

if __name__ == "__main__":
    print(timeit.timeit('list_clean(text)', 'from __main__ import text,list_clean', number=5))
    print(timeit.timeit('set_clean(text)', 'from __main__ import text,set_clean', number=5))

Output

7.6629380420199595
0.8327891009976156

In the code above list_clean is a function that removes stopwords using a list and set_clean is a function that removes stopwords using a set. The first time corresponds to list_clean and the second time corresponds to set_clean. For the given example the set_clean is almost 10 times faster.

UPDATE

The O(n*m) and O(n + m) are examples of big o notation, a theoretical approach of measuring the efficiency of algorithms. Basically the larger the polynomial the less efficient the algorithm, in this case O(n*m) is larger than O(n + m) so the list_clean method is theoretically less efficient than the set_clean method. This numbers come from the fact that search in list is O(n) and searching in a set takes a constant amount of time, often referred as O(1).

Thanks, but I can't understand 0(n*m) and o(n+m) mean. Could you tell me what it is? — justin_sakong, Oct 10 '18 at 03:22

python nltk processing with text, remove stopwords quickly

1 Answers1