Removing stopwords that begin a sentence with NLTK

Question

I'm attempting to remove all the stop words from text input. The code below removes all the stop words, except one that begin a sentence.

How do I remove those words?

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.corpus import stopwords
stopwords_nltk_en = set(stopwords.words('english'))

from string import punctuation
exclude_punctuation = set(punctuation)

stoplist_combined = set.union(stopwords_nltk_en, exclude_punctuation)

def normalized_text(text):
   lemma = WordNetLemmatizer()
   stopwords_punctuations_free = ' '.join([i for i in text.lower().split() if i not in stoplist_combined])
   normalized = ' '.join(lemma.lemmatize(word) for word in stopwords_punctuations_free.split())
return normalized


sentence = [['The birds are always in their house.'], ['In the hills the birds nest.']]

for item in sentence:
  print (normalized_text(str(item)))

OUTPUT: 
   the bird always house 
   in hill bird nest

Please don't clean your text in this manner, take a look at https://www.kaggle.com/alvations/basic-nlp-with-nltk . You're iterating through the text multiple times for no good reason. — alvas, Oct 19 '18 at 08:39
You didn't read the full kernel ;P Lemmatization needs POS tags. — alvas, Oct 19 '18 at 22:21
I did read the page and did attempted the POS tag item, but I couldn't get the output to work as I wanted. The output was in a list, but I wanted a string like my other code was outputting. Suggestions are welcome to get the same output. — Life is complex, Oct 19 '18 at 22:40
https://stackoverflow.com/questions/12453580/concatenate-item-in-list-to-strings ;P — alvas, Oct 20 '18 at 04:02
thanks for the link. After more testing I noted that the lemmatize piece was removing items that I needed to analyze. I plan to use my output in some type of text classification. — Life is complex, Oct 20 '18 at 14:00

Neb · Accepted Answer · 2018-10-19T16:25:44.827

1

The culprit is this line of code:

print (normalized_text(str(item)))

If you try to print str(item) for the first element of your sentence list, you'll get:

['The birds are always in their house.']

which, then, lowered and split becomes:

["['the", 'birds', 'are', 'always', 'in', 'their', "house.']"]

As you can see, the first element is ['the which does not match the stop word the.

Solution: Use ''.join(item) to convert item to str

Edit after comment

Inside the text string there are still some apices '. To solve, call the normalized as:

for item in sentence:
    print (normalized_text(item))

Then, import the regex module with import re and change:

text.lower().split()

with:

re.split('\'| ', ''.join(text).lower())

edited Oct 19 '18 at 16:25

answered Oct 18 '18 at 21:01

Neb

2,270
1
12
22

You are correct, this modification solved my issue in my example code, but I'm still having an issue in the production piece. – Life is complex Oct 19 '18 at 15:07
In the production code, I'm reading in a text file, which contains lines : 'The birds are always in their house', 'In the hills the birds nest', 'No birds are in their homes', etc..., – Life is complex Oct 19 '18 at 15:10
This is the output of a single line: ["'in", 'the', 'hills', 'the', 'birds', 'nest'"] using with open('summaries.txt', 'r') as input: lines = input.readlines() for line in lines: test = ''.join(line) print (test.lower().split()) – Life is complex Oct 19 '18 at 15:12
Could you elaborate more? I can't understand which problem you're facing now – Neb Oct 19 '18 at 15:45
look at the output of this: sentence = [["'The birds are always in their house'"], ["'In the hills the birds nest'"]] – Life is complex Oct 19 '18 at 16:08
Thanks, that worked. Is there a way to use both checks in my code? – Life is complex Oct 19 '18 at 17:46
Which checks do you refer to? – Neb Oct 19 '18 at 17:47
Disregard it's working on all the text items that I have tried. THANKS!! – Life is complex Oct 19 '18 at 18:00

Removing stopwords that begin a sentence with NLTK

1 Answers1