0

I'm attempting to remove all the stop words from text input. The code below removes all the stop words, except one that begin a sentence.

How do I remove those words?

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.corpus import stopwords
stopwords_nltk_en = set(stopwords.words('english'))

from string import punctuation
exclude_punctuation = set(punctuation)

stoplist_combined = set.union(stopwords_nltk_en, exclude_punctuation)

def normalized_text(text):
   lemma = WordNetLemmatizer()
   stopwords_punctuations_free = ' '.join([i for i in text.lower().split() if i not in stoplist_combined])
   normalized = ' '.join(lemma.lemmatize(word) for word in stopwords_punctuations_free.split())
return normalized


sentence = [['The birds are always in their house.'], ['In the hills the birds nest.']]

for item in sentence:
  print (normalized_text(str(item)))

OUTPUT: 
   the bird always house 
   in hill bird nest
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • Please don't clean your text in this manner, take a look at https://www.kaggle.com/alvations/basic-nlp-with-nltk . You're iterating through the text multiple times for no good reason. – alvas Oct 19 '18 at 08:39
  • I modified some of the code as you suggested. – Life is complex Oct 19 '18 at 16:12
  • You didn't read the full kernel ;P Lemmatization needs POS tags. – alvas Oct 19 '18 at 22:21
  • I did read the page and did attempted the POS tag item, but I couldn't get the output to work as I wanted. The output was in a list, but I wanted a string like my other code was outputting. Suggestions are welcome to get the same output. – Life is complex Oct 19 '18 at 22:40
  • https://stackoverflow.com/questions/12453580/concatenate-item-in-list-to-strings ;P – alvas Oct 20 '18 at 04:02
  • thanks for the link. After more testing I noted that the lemmatize piece was removing items that I needed to analyze. I plan to use my output in some type of text classification. – Life is complex Oct 20 '18 at 14:00

1 Answers1

1

The culprit is this line of code:

print (normalized_text(str(item)))

If you try to print str(item) for the first element of your sentence list, you'll get:

['The birds are always in their house.']

which, then, lowered and split becomes:

["['the", 'birds', 'are', 'always', 'in', 'their', "house.']"]

As you can see, the first element is ['the which does not match the stop word the.

Solution: Use ''.join(item) to convert item to str


Edit after comment

Inside the text string there are still some apices '. To solve, call the normalized as:

for item in sentence:
    print (normalized_text(item))

Then, import the regex module with import re and change:

text.lower().split()

with:

re.split('\'| ', ''.join(text).lower())
Neb
  • 2,270
  • 1
  • 12
  • 22