2

I am using NLTK to replace all stopwords with the string "QQQQQ". The problem is that if the input sentence (from which I remove the stopwords) has more than one sentence, then it doesn't work properly.

I have the following code:

ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized=word_tokenize(ex_text)

stop_words=set(stopwords.words('english'))
stop_words.add(".")  #Since I do not need punctuation, I added . and ,
stop_words.add(",")

# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w in stop_words:    
        stopword_pos.append(tokenized.index(w))

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]]='QQQQQ'  

print(tokenized)

That code gives the following output:

['This', 'QQQQQ', 'QQQQQ', 'example', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'special', 'keywords', 'QQQQQ', 'sum', 'QQQQQ', 'QQQQQ', 'list', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'QQQQQ', 'Another', 'list', 'is', 'QQQQQ', 'QQQQQ', 'special', 'one', 'QQQQQ', 'I', 'like', 'very', 'much', '.']

As you might notice, it doesn't replace stopwords like 'is' and '.' (I added fullstop to the set, since I didn't want punctuation).

Though keep in mind that 'is' and '.' in the first sentence get replaced, but the 'is' and '.' in the second sentence don´t.

Another weird thing that happens is that when I print stopword_pos, I get the following output:

[0, 1, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 1, 24, 25, 0, 29, 25, 20]

As you might notice, the numbers seem to be arranged in ascending order, but suddenly, you have a '1' after '20' in the list that is supposed to hold the position of the stopwords. Also, you have '0' after '29' and '20' after '25'. Perhaps that might tell what the problem is.

So, the problem is that after the first sentence, the stopwords don't get replaced with 'QQQQQ's. Why is that?

Anything pointing me in the right direction is much appreciated. I don't have any clue how to solve the problem.

2 Answers2

1

tokenized.index(w) this gives you the first occurrence of the item in the list.

So, instead of taking the index, you can try some alternative ways to replace stopwords.

tokenized_new = [ word if word not in stop_words else 'QQQQQ' for word in tokenized ]
Van Peer
  • 2,127
  • 2
  • 25
  • 35
  • I thought it would be better to find the index using something other than the ´index()´. So, how do you find the index if there are multiple occurences of an item in a list? – FrontEnd-Python Aug 04 '18 at 15:07
  • @FrontEnd-Python you can use `enumerate` as shown in student's answer to get that. – Van Peer Aug 04 '18 at 15:09
1

The problem is, .index does not return all the indices so, you will need something similar as mentioned in other question.

stopword_pos_set = set() # creating set so that index is not added twice
# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w.lower() in stop_words: 
        indices = [i for i, x in enumerate(tokenized) if x == w]
        stopword_pos_set.update(indices)

stopword_pos = list(stopword_pos_set) # convert to list

In above, I created stopword_pos_set, so that same index are not added twice, it will just assign same value twice but when you print stopword_pos without set you will see duplicate values.

One suggestion is, in above code, I changed it to if w.lower() in stop_words:, so that when you check for stopwords without case-sensitive, otherwise 'This' is not same as 'this'.

Other suggestion is to use .update method to update with multiple items in stop_words set with stop_words.update([".", ","]) instead of .add multiple times.


You can try as below:

ex_text='This is an example list that has no special keywords to sum up the list, but it will do. Another list is a very special one this I like very much.'

tokenized = word_tokenize(ex_text)
stop_words = set(stopwords.words('english'))
stop_words.update([".", ","])  #Since I do not need punctuation, I added . and ,

stopword_pos_set = set()
# I need to note the position of all the stopwords for later use
for w in tokenized:
    if w.lower() in stop_words: 
        indices = [i for i, x in enumerate(tokenized) if x == w]
        stopword_pos_set.update(indices)

stopword_pos = sorted(list(stopword_pos_set)) # set to list

# Replacing stopwords with "QQQQQ"
for i in range(len(stopword_pos)):
    tokenized[stopword_pos[i]] = 'QQQQQ'  

print(tokenized)
print(stopword_pos)
niraj
  • 17,498
  • 4
  • 33
  • 48
  • I would expect the ´stopword_pos´ list to be in ascending order, but that isn't the case. Here's the output: [0, 1, 23, 2, 5, 6, 7, 10, 12, 13, 15, 16, 17, 18, 19, 20, 33, 1, 23, 24, 25, 31, 28, 29, 25, 31, 20, 33] – FrontEnd-Python Aug 04 '18 at 15:40
  • You can use `sorted` when converting to list as I updated above i.e. `stopword_pos = sorted(list(stopword_pos_set)) . Also, I created `set` and converted to `list` to avoid duplicate indices. – niraj Aug 04 '18 at 15:42
  • What is the difference between add and update? And why to use update? – FrontEnd-Python Aug 04 '18 at 15:48
  • If you want to add multiple items, `update` allows to add together but `add` adds one item, you can also check in https://stackoverflow.com/questions/28845284/add-vs-update-in-set-operations-in-python – niraj Aug 04 '18 at 15:50
  • Why use sorted? How would it sort it? And yeah, thanks a lot for the help. – FrontEnd-Python Aug 04 '18 at 16:03
  • So, if you want in ascending order, you would probably want to sort it since words index may not necessarily be in order when you look for them. Suppose you are looking for index of word `is`, then it occurs in index 2 and somewhere towards end let's say 23, now your index are added with {0,23}, next word may be at index 5 but when appending it goes to {0,23,5} so at the end it is not in ascending order and you use the sort to sort them. – niraj Aug 04 '18 at 16:10