4

I'm processing a textblob and one of the steps is stopwords removal. Textblobs are immutable, so I'm turning one into a list to do the job:

blob = tb(tekst)
lista = [word for word in blob.words if word not in stopwords.words('english')]
tekst = ' '.join(lista)
blob = tb(tekst)

Is there a simpler / more elegant solution for the problem?

Zygmunt
  • 41
  • 1
  • 2
  • 1
    Check out nltk... Simlilar question, was already answered... https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python – Christian Will Oct 26 '17 at 16:36
  • 1
    your code seems perfectly fine as for as removing stops words are concerned. that's kind of the standard way. – utengr Oct 26 '17 at 16:41
  • 2
    Yeah going off of what utengr said, there's not really a more efficient way to do it. You're going to have to look at every word anyway. Only thing that you could do to make it more efficient is not actually construct the list and use a generator. Just change your `[]`s to `()` as in: `(word for word in blob.words if word not in stopwords.words('english'))`. You'll never be able to access the list again after you use it but you join it right away anyway. – Nick Chapman Oct 26 '17 at 16:49
  • Thanks everybody! Christian Will: valid point, but I wanted to avoid using nltk. Nick Chapman: sounds great - it is a single serving list indeed. If it is a more CPU-efficient solution it is the one I was looking for. – Zygmunt Oct 26 '17 at 17:20

1 Answers1

0

You can try this code:

from textblob import TextBlob
from nltk.corpus import stopwords

b="Do not purchase these earphones. It will automatically disconnect and reconnect. Worst product to buy."
text=TextBlob(b)

# Tokens
tokens=set(text.words)
print("Tokens: ",tokens)
# stopwords
stop=set(stopwords.words("english"))

# Removing stop words using set difference operation
print("Filtered Tokens: ",tokens-stop)

Output: *Tokens: {'buy', 'disconnect', 'will', 'to', 'purchase', 'reconnect', 'product', 'It', 'Do', 'and', 'Worst', 'earphones', 'not', 'automatically', 'these'}

Filtered Tokens: {'buy', 'disconnect', 'purchase', 'reconnect', 'product', 'It', 'Do', 'Worst', 'earphones', 'automatically'}*

Avinash
  • 485
  • 1
  • 7
  • 15