0

I would like to remove the stopwords that are in a list of a list while keeping the format the same (i.e. a list of a list)

Following is the code that I have already tried

sent1 = 'I have a sentence which is a list'
sent2 = 'I have a sentence which is another list'

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

lst = [sent1, sent2]
sent_lower = [t.lower() for t in lst]

filtered_words=[]
for i in sent_lower:
    i_split = i.split()
    lst = []
    for j in i_split:
        if j not in stop_words:
            lst.append(j)
            " ".join(lst)
            filtered_words.append(lst)

Current Output of filtered_words:

filtered_words
[['sentence', 'list'],
 ['sentence', 'list'],
 ['sentence', 'another', 'list'],
 ['sentence', 'another', 'list'],
 ['sentence', 'another', 'list']]

Desired Output of filtered_words:

filtered_words
[['sentence', 'list'],
 ['sentence', 'another', 'list']]

I am getting a duplicate of list. What might I be doing wrong in the loop? Also is there a better way of doing this rather than writing so many for loops?

Molia
  • 311
  • 2
  • 17

3 Answers3

4

What you're doing wrong is appending lst to filtered_words each time you find a non-stopword. That's why you have 2 repetitions of the filtered sent1 (it contains 2 non-stopwords) and 3 repetitions of the filtered sent2 (it contains 3 non-stopwords). Just append after you've examined each sentence:

for i in sent_lower:
    i_split = i.split()
    lst = []
    for j in i_split:
        if j not in stop_words:
            lst.append(j)
    filtered_words.append(lst)

By the way, the statement

" ".join(lst)

is not useful, since you're computing something (a string) but not storing it anywhere.

EDIT

A more Pythonic way to do this with list comprehension:

for s in sent_lower:
    lst = [j for j in s.split() if j not in stop_words]
    filtered_words.append(lst)
1

You can use itertools once you have the duplicated result in filtered_words -

import itertools
filtered_words.sort()
list(filtered_words for filtered_words,_ in itertools.groupby(filtered_words))

The output comes out to be -

[['sentence', 'another', 'list'], ['sentence', 'list']]

I followed a link on StackOverflow - Remove duplicates from a list of list

0

This will give you the desired result

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

sent1 = 'I have a sentence which is a list'
sent2 = 'I have a sentence which is another list'

sent1 = sent1.lower().split()
sent2 = sent2.lower().split()

l = [sent1, sent2]

for n, sent in enumerate(l):
    for stop_word in stop_words:
        sent = [word for word in sent if word != stop_word]
    l[n] = sent

print(l)
grizzasd
  • 363
  • 3
  • 15