Faster way to remove stop words in Python

Question

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.

Note: I tried someone's suggest of wrapping stopwords.words('english') with set() but that made no difference.

Thank you.

did you wrap it inside list comprehension or outside? try add stw_set = set(stopwords.words('english')) and use this object instead — alko, Oct 24 '13 at 08:27
@alko I thought I wrapped it outside and had no effect, but I just tried it again and my code is running at least 10x faster now!!! — mchangun, Oct 24 '13 at 08:34
Possible duplicate of [Stopword removal with NLTK](http://stackoverflow.com/questions/19130512/stopword-removal-with-nltk) — Salvador Dali, Apr 02 '17 at 07:00

score 120 · Accepted Answer · answered Oct 24 '13 at 08:35

120

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.

answered Oct 24 '13 at 08:35

Andy Rimmer

1,901
1
13
9

Agreed. The performance boosts comes from caching the stopwords, not really in creating a `set`. – mchangun Oct 24 '13 at 11:41
12

Certainly you get a dramatic boost from not having to read the list from disk every time, because that's the most time-consuming operation. But if you now turn your "cached" list into a set (just once, of course), you'll get another boost. – alexis May 09 '15 at 13:47
can anyone tell me if this supports japanese? – Jay Nirgudkar Mar 23 '16 at 08:21
it gives me this UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal text=' '.join([word for word in text.split() if word not in stop_words]) please Salomone provide me solution to this – Abdul Rehman Janjua Aug 16 '16 at 14:43

score 31 · Answer 2 · edited Sep 12 '19 at 10:55

31

Sorry for late reply. Would prove useful for new users.

Create a dictionary of stopwords using collections library

Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))

from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])

edited Sep 12 '19 at 10:55

ali sharifi

3
1
1

answered Mar 08 '19 at 15:30

Gulshan Jangid

463
4
7

This does indeed speed things up considerably even in comparison to regexp based approach. – Diego Sep 30 '19 at 13:57
1

This was indeed a great answer and I wish this was more up there. It's incredible how fast this was when removing words from text from a list of 20k itens. Regular way took more than 1 hour, while using Counter took 20 seconds. – mrbTT Nov 22 '19 at 16:28
Can you explain how 'Counter' speeds up the process? @Gulshan Jangid – Karan Bari Dec 14 '19 at 13:32
3

well the main reason for the above code being fast is that we are searching in a dictionary which is basically a hashmap. And in hashmap the search time is O(1). Other than that Counter is part of collections library and library is written in C, and since C is way faster than python therefore Counter is faster than similar code written in python – Gulshan Jangid Dec 15 '19 at 16:20
Just tested this and it's in average 3x faster than the regexp approach. A simple yet creative solution, the current best by far. – Julio Cezar Silva Sep 16 '20 at 01:36
3

Using this (`collections.Counter(stopwords.words('english')`) cant be faster than using `set(stopwords.words('english'))` I believe. The collections.Counter method only unnecessarily uses more memory. – mikey Jun 16 '21 at 12:56

score 25 · Answer 3 · answered Oct 24 '13 at 08:47

25

Use a regexp to remove all words which do not match:

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

This will probably be way faster than looping yourself, especially for large input strings.

If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

answered Oct 24 '13 at 08:47

Alfe

56,346
20
107
159

Any idea what the complexity of this would be? If w = number of words in my text and s = number of words in the stop list, I think looping would be on the order of `w log s`. In this case, w is approx s so it's `w log w`. Wouldn't grep be slower since it (roughly) has to match character by character? – mchangun Oct 24 '13 at 11:40
3

Actually I think the complexities in the meaning of O(…) are the same. Both are `O(w log s)`, yes. **BUT** regexps are implemented on a much lower level and optimized heavily. Already the splitting of words will lead to copying everything, creating a list of strings, and the list itself, all that takes precious time. – Alfe Oct 24 '13 at 12:08
This approach is *much* faster than splitting lines, word tokenizing, then checking each word in a stopwords set. Particularly for larger text inputs – Bobs Burgers Aug 25 '20 at 13:38

Krzysztof Szularz · Answer 4 · 2021-11-17T08:54:48.957

7

First, you're creating stop words for each string. Create it once. Set would be great here indeed.

forbidden_words = set(stopwords.words('english'))

Later, get rid of [] inside join. Use generator instead.

Replace

' '.join([x for x in ['a', 'b', 'c']])

with

' '.join(x for x in ['a', 'b', 'c'])

Next thing to deal with would be to make .split() yield values instead of returning an array. ~~I believe regex would be good replacement here.~~ See thist hread for why s.split() is actually fast.

Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.

edited Nov 17 '21 at 08:54

answered Oct 24 '13 at 08:30

Krzysztof Szularz

5,151
24
35

1

I doubt using regexp gonna be an improvement, see http://stackoverflow.com/questions/7501609/python-re-split-vs-split/7501659#7501659 – alko Oct 24 '13 at 08:34
Found it just now as well. :) – Krzysztof Szularz Oct 24 '13 at 08:38
1

Thanks. The `set` made at least an 8x improvement to speed. Why does using a generator help? RAM isn't an issue for me because each piece of text is quite small, about 100-200 words. – mchangun Oct 24 '13 at 08:38
2

Actually, I've seen `join` perform better with a list comprehension than the equivalent generator expression. – Janne Karila Oct 24 '13 at 08:42
1

Set difference seems to work too `clean_text = set(text.lower().split()) - set(stopwords.words('english'))` – wmik Oct 25 '19 at 15:50

score 2 · Answer 5 · answered Aug 06 '21 at 07:35

Try using this by avoid looping and instead using regex to remove stopwords:

import re
from nltk.corpus import stopwords

cachedStopWords = stopwords.words("english")
pattern = re.compile(r'\b(' + r'|'.join(cachedStopwords) + r')\b\s*')
text = pattern.sub('', text)

maxandron · Answer 6 · 2022-02-13T17:00:13.823

Using just a regular dict seems to be the fastest solution by far.
Surpassing even the Counter solution by about 10%

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'
text = " ".join([word for word in text.split() if word not in stopwords_dict])

Tested using the cProfile profiler

You can find the test code used here: https://gist.github.com/maxandron/3c276924242e7d29d9cf980da0a8a682

EDIT:

On top of that if we replace the list comprehension with a loop we get another 20% increase in performance

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'

new = ""
for word in text.split():
    if word not in stopwords_dict:
        new += word
text = new

Faster way to remove stop words in Python

6 Answers6

Linked