Try converting stopwords
to a set. Using a list, your approach is O(n*m)
where n is the number of words in text and m
is the number of stop-words, using a set
the approach is O(n + m)
. Let's compare both approaches list
vs set
:
import timeit
from nltk.corpus import stopwords
def list_clean(text):
stop_words = stopwords.words('english')
return [w for w in text if w.lower() not in stop_words]
def set_clean(text):
set_stop_words = set(stopwords.words('english'))
return [w for w in text if w.lower() not in set_stop_words]
text = ['the', 'cat', 'is', 'on', 'the', 'table', 'that', 'is', 'in', 'some', 'room'] * 100000
if __name__ == "__main__":
print(timeit.timeit('list_clean(text)', 'from __main__ import text,list_clean', number=5))
print(timeit.timeit('set_clean(text)', 'from __main__ import text,set_clean', number=5))
Output
7.6629380420199595
0.8327891009976156
In the code above list_clean
is a function that removes stopwords using a list
and set_clean
is a function that removes stopwords using a set
. The first time corresponds to list_clean
and the second time corresponds to set_clean
. For the given example the set_clean
is almost 10 times faster.
UPDATE
The O(n*m)
and O(n + m)
are examples of big o notation, a theoretical approach of measuring the efficiency of algorithms. Basically the larger the polynomial the less efficient the algorithm, in this case O(n*m)
is larger than O(n + m)
so the list_clean
method is theoretically less efficient than the set_clean
method. This numbers come from the fact that search in list is O(n)
and searching in a set
takes a constant amount of time, often referred as O(1)
.