60

I am trying to remove stopwords from a string of text:

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

I am processing 6 mil of such strings so speed is important. Profiling my code, the slowest part is the lines above, is there a better way to do this? I'm thinking of using something like regex's re.sub but I don't know how to write the pattern for a set of words. Can someone give me a hand and I'm also happy to hear other possibly faster methods.

Note: I tried someone's suggest of wrapping stopwords.words('english') with set() but that made no difference.

Thank you.

mchangun
  • 9,814
  • 18
  • 71
  • 101

6 Answers6

120

Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.

nCalls Cumulative Time

10000 7.723 words.py:7(testFuncOld)

10000 0.140 words.py:11(testFuncNew)

So, caching the stopwords instance gives a ~70x speedup.

Andy Rimmer
  • 1,901
  • 1
  • 13
  • 9
  • Agreed. The performance boosts comes from caching the stopwords, not really in creating a `set`. – mchangun Oct 24 '13 at 11:41
  • 12
    Certainly you get a dramatic boost from not having to read the list from disk every time, because that's the most time-consuming operation. But if you now turn your "cached" list into a set (just once, of course), you'll get another boost. – alexis May 09 '15 at 13:47
  • can anyone tell me if this supports japanese? – Jay Nirgudkar Mar 23 '16 at 08:21
  • it gives me this UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal text=' '.join([word for word in text.split() if word not in stop_words]) please Salomone provide me solution to this – Abdul Rehman Janjua Aug 16 '16 at 14:43
31

Sorry for late reply. Would prove useful for new users.

  • Create a dictionary of stopwords using collections library
  • Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))

    from collections import Counter
    stop_words = stopwords.words('english')
    stopwords_dict = Counter(stop_words)
    text = ' '.join([word for word in text.split() if word not in stopwords_dict])
    
ali sharifi
  • 3
  • 1
  • 1
Gulshan Jangid
  • 463
  • 4
  • 7
  • This does indeed speed things up considerably even in comparison to regexp based approach. – Diego Sep 30 '19 at 13:57
  • 1
    This was indeed a great answer and I wish this was more up there. It's incredible how fast this was when removing words from text from a list of 20k itens. Regular way took more than 1 hour, while using Counter took 20 seconds. – mrbTT Nov 22 '19 at 16:28
  • Can you explain how 'Counter' speeds up the process? @Gulshan Jangid – Karan Bari Dec 14 '19 at 13:32
  • 3
    well the main reason for the above code being fast is that we are searching in a dictionary which is basically a hashmap. And in hashmap the search time is O(1). Other than that Counter is part of collections library and library is written in C, and since C is way faster than python therefore Counter is faster than similar code written in python – Gulshan Jangid Dec 15 '19 at 16:20
  • Just tested this and it's in average 3x faster than the regexp approach. A simple yet creative solution, the current best by far. – Julio Cezar Silva Sep 16 '20 at 01:36
  • 3
    Using this (`collections.Counter(stopwords.words('english')`) cant be faster than using `set(stopwords.words('english'))` I believe. The collections.Counter method only unnecessarily uses more memory. – mikey Jun 16 '21 at 12:56
25

Use a regexp to remove all words which do not match:

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

This will probably be way faster than looping yourself, especially for large input strings.

If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.

Alfe
  • 56,346
  • 20
  • 107
  • 159
  • Any idea what the complexity of this would be? If w = number of words in my text and s = number of words in the stop list, I think looping would be on the order of `w log s`. In this case, w is approx s so it's `w log w`. Wouldn't grep be slower since it (roughly) has to match character by character? – mchangun Oct 24 '13 at 11:40
  • 3
    Actually I think the complexities in the meaning of O(…) are the same. Both are `O(w log s)`, yes. **BUT** regexps are implemented on a much lower level and optimized heavily. Already the splitting of words will lead to copying everything, creating a list of strings, and the list itself, all that takes precious time. – Alfe Oct 24 '13 at 12:08
  • This approach is *much* faster than splitting lines, word tokenizing, then checking each word in a stopwords set. Particularly for larger text inputs – Bobs Burgers Aug 25 '20 at 13:38
7

First, you're creating stop words for each string. Create it once. Set would be great here indeed.

forbidden_words = set(stopwords.words('english'))

Later, get rid of [] inside join. Use generator instead.

Replace

' '.join([x for x in ['a', 'b', 'c']])

with

' '.join(x for x in ['a', 'b', 'c'])

Next thing to deal with would be to make .split() yield values instead of returning an array. I believe regex would be good replacement here. See thist hread for why s.split() is actually fast.

Lastly, do such a job in parallel (removing stop words in 6m strings). That is a whole different topic.

Krzysztof Szularz
  • 5,151
  • 24
  • 35
  • 1
    I doubt using regexp gonna be an improvement, see http://stackoverflow.com/questions/7501609/python-re-split-vs-split/7501659#7501659 – alko Oct 24 '13 at 08:34
  • Found it just now as well. :) – Krzysztof Szularz Oct 24 '13 at 08:38
  • 1
    Thanks. The `set` made at least an 8x improvement to speed. Why does using a generator help? RAM isn't an issue for me because each piece of text is quite small, about 100-200 words. – mchangun Oct 24 '13 at 08:38
  • 2
    Actually, I've seen `join` perform better with a list comprehension than the equivalent generator expression. – Janne Karila Oct 24 '13 at 08:42
  • 1
    Set difference seems to work too `clean_text = set(text.lower().split()) - set(stopwords.words('english'))` – wmik Oct 25 '19 at 15:50
2

Try using this by avoid looping and instead using regex to remove stopwords:

import re
from nltk.corpus import stopwords

cachedStopWords = stopwords.words("english")
pattern = re.compile(r'\b(' + r'|'.join(cachedStopwords) + r')\b\s*')
text = pattern.sub('', text)
Anurag Dhadse
  • 1,722
  • 1
  • 13
  • 26
0

Using just a regular dict seems to be the fastest solution by far.
Surpassing even the Counter solution by about 10%

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'
text = " ".join([word for word in text.split() if word not in stopwords_dict])

Tested using the cProfile profiler

You can find the test code used here: https://gist.github.com/maxandron/3c276924242e7d29d9cf980da0a8a682

EDIT:

On top of that if we replace the list comprehension with a loop we get another 20% increase in performance

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'

new = ""
for word in text.split():
    if word not in stopwords_dict:
        new += word
text = new
maxandron
  • 1,650
  • 2
  • 12
  • 16