I am using NLTK to replace all stopwords with the string "QQQQQ"
. The problem is that if the input sentence (from which I remove the stopwords) has more than one sentence, then it doesn't work properly.
I have the following code:
ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'
tokenized=word_tokenize(ex_text)
stop_words=set(stopwords.words('english'))
stop_words.add(".") #Since I do not need punctuation, I added . and ,
stop_words.add(",")
# I need to note the position of all the stopwords for later use
for w in tokenized:
if w in stop_words:
stopword_pos.append(tokenized.index(w))
# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
tokenized[stopword_pos[i]]='QQQQQ'
print(tokenized)
That code gives the following output:
['This', 'QQQQQ', 'QQQQQ', 'example', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'special', 'keywords', 'QQQQQ', 'sum', 'QQQQQ', 'QQQQQ', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'Another', 'list', 'is', 'QQQQQ', 'QQQQQ', 'special', 'one', 'QQQQQ', 'I', 'like', 'very', 'much', '.']
As you might notice, it doesn't replace stopwords like 'is' and '.' (I added fullstop to the set, since I didn't want punctuation).
Though keep in mind that 'is' and '.' in the first sentence get replaced, but the 'is' and '.' in the second sentence don´t.
Another weird thing that happens is that when I print stopword_pos
, I get the following output:
[0, 1, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 1, 24, 25, 0, 29, 25, 20]
As you might notice, the numbers seem to be arranged in ascending order, but suddenly, you have a '1' after '20' in the list that is supposed to hold the position of the stopwords. Also, you have '0' after '29' and '20' after '25'. Perhaps that might tell what the problem is.
So, the problem is that after the first sentence, the stopwords don't get replaced with 'QQQQQ's. Why is that?
Anything pointing me in the right direction is much appreciated. I don't have any clue how to solve the problem.