0

I am trying to remove stopwords from my text.

I have tried using the code below.

from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text='I love coding'
my_text=re.sub("|".join(sw),"",my_text)
print(my_text)

Expected result: love coding. Actual result: I l cng (since 'o' and 've' are both found in the stopwords list "sw").

How can I get the expected result?

threxx
  • 1,213
  • 1
  • 31
  • 59
  • 1
    https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python possible duplication... – Aldric Jul 25 '19 at 14:58

2 Answers2

0

You need to replace words, not characrters:

from itertools import filterfalse
from nltk.corpus import stopwords
sw = stopwords.words("english")
my_text = 'I love coding'
my_words = my_text.split() # naive split to words
no_stopwords = ' '.join(filterfalse(sw.__contains__, my_words))

You should also worry about splitting sentences, case sensitivity, etc.

There are libraries to do this properly since this is a common, non-trivial, problem.

Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88
0

Split the sentence to words before removing the stop words and then run

from nltk import word_tokenize
from nltk.corpus import stopwords
stop = set(stopwords.words('english'))
sentence = 'I love coding'
print([i for i in sentence.lower().split() if i not in stop])
>>> ['love', 'coding']
print(" ".join([i for i in sentence.lower().split() if i not in stop]))
>>> "love coding"
Sundeep Pidugu
  • 2,377
  • 2
  • 21
  • 43