1

I'm using nltk for processing text data. When I want to use stopwords, I usually use this code.

text_clean = [w for w in text if w.lower() not in stopwords]

But this code always takes too long.(Maybe my data is too big...)
Is there any method to reduce the time? Thanks.

justin_sakong
  • 249
  • 1
  • 6
  • 12

1 Answers1

3

Try converting stopwords to a set. Using a list, your approach is O(n*m) where n is the number of words in text and m is the number of stop-words, using a set the approach is O(n + m). Let's compare both approaches list vs set:

import timeit
from nltk.corpus import stopwords


def list_clean(text):
    stop_words = stopwords.words('english')
    return [w for w in text if w.lower() not in stop_words]


def set_clean(text):
    set_stop_words = set(stopwords.words('english'))
    return [w for w in text if w.lower() not in set_stop_words]

text = ['the', 'cat', 'is', 'on', 'the', 'table', 'that', 'is', 'in', 'some', 'room'] * 100000

if __name__ == "__main__":
    print(timeit.timeit('list_clean(text)', 'from __main__ import text,list_clean', number=5))
    print(timeit.timeit('set_clean(text)', 'from __main__ import text,set_clean', number=5))

Output

7.6629380420199595
0.8327891009976156

In the code above list_clean is a function that removes stopwords using a list and set_clean is a function that removes stopwords using a set. The first time corresponds to list_clean and the second time corresponds to set_clean. For the given example the set_clean is almost 10 times faster.

UPDATE

The O(n*m) and O(n + m) are examples of big o notation, a theoretical approach of measuring the efficiency of algorithms. Basically the larger the polynomial the less efficient the algorithm, in this case O(n*m) is larger than O(n + m) so the list_clean method is theoretically less efficient than the set_clean method. This numbers come from the fact that search in list is O(n) and searching in a set takes a constant amount of time, often referred as O(1).

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76