Removing stopwords in a list of a list

Question

I would like to remove the stopwords that are in a list of a list while keeping the format the same (i.e. a list of a list)

Following is the code that I have already tried

sent1 = 'I have a sentence which is a list'
sent2 = 'I have a sentence which is another list'

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

lst = [sent1, sent2]
sent_lower = [t.lower() for t in lst]

filtered_words=[]
for i in sent_lower:
    i_split = i.split()
    lst = []
    for j in i_split:
        if j not in stop_words:
            lst.append(j)
            " ".join(lst)
            filtered_words.append(lst)

Current Output of filtered_words:

filtered_words
[['sentence', 'list'],
 ['sentence', 'list'],
 ['sentence', 'another', 'list'],
 ['sentence', 'another', 'list'],
 ['sentence', 'another', 'list']]

Desired Output of filtered_words:

filtered_words
[['sentence', 'list'],
 ['sentence', 'another', 'list']]

I am getting a duplicate of list. What might I be doing wrong in the loop? Also is there a better way of doing this rather than writing so many for loops?

Could you add the full function please? What is giving you that output? `filtered_words`? — Jamie J, Jul 02 '19 at 14:31

Diego Chinellato · Accepted Answer · 2019-07-02T15:13:34.603

4

What you're doing wrong is appending lst to filtered_words each time you find a non-stopword. That's why you have 2 repetitions of the filtered sent1 (it contains 2 non-stopwords) and 3 repetitions of the filtered sent2 (it contains 3 non-stopwords). Just append after you've examined each sentence:

for i in sent_lower:
    i_split = i.split()
    lst = []
    for j in i_split:
        if j not in stop_words:
            lst.append(j)
    filtered_words.append(lst)

By the way, the statement

" ".join(lst)

is not useful, since you're computing something (a string) but not storing it anywhere.

EDIT

A more Pythonic way to do this with list comprehension:

for s in sent_lower:
    lst = [j for j in s.split() if j not in stop_words]
    filtered_words.append(lst)

edited Jul 02 '19 at 15:13

answered Jul 02 '19 at 14:58

Diego Chinellato

165
1
9

You're welcome. BTW, I've just added an optimized version of the code. – Diego Chinellato Jul 02 '19 at 15:16
Thank you! I knew there was a better way to do this! – Molia Jul 02 '19 at 15:20

score 1 · Answer 2 · answered Jul 02 '19 at 14:58

You can use itertools once you have the duplicated result in filtered_words -

import itertools
filtered_words.sort()
list(filtered_words for filtered_words,_ in itertools.groupby(filtered_words))

The output comes out to be -

[['sentence', 'another', 'list'], ['sentence', 'list']]

I followed a link on StackOverflow - Remove duplicates from a list of list

score 0 · Answer 3 · answered Jul 02 '19 at 15:14

This will give you the desired result

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

sent1 = 'I have a sentence which is a list'
sent2 = 'I have a sentence which is another list'

sent1 = sent1.lower().split()
sent2 = sent2.lower().split()

l = [sent1, sent2]

for n, sent in enumerate(l):
    for stop_word in stop_words:
        sent = [word for word in sent if word != stop_word]
    l[n] = sent

print(l)

Removing stopwords in a list of a list

3 Answers3