Please excuse my noobness. I have a list of lists:
print(tokens)
[['What', "'s", 'my', 'name', '?'], ['My', 'name', 'is', 'Aditya', '.'], ['My', 'name', 'is', 'Glen'],
['My', 'name', 'is', 'Kenta', '.'], ['My', 'name', 'is', 'Keita'], ['My', 'name', 'is', 'Ganchan'],
['My', 'name', 'is', 'Anna', '.'], ['My', 'name', 'is', 'Tho'], ['My', 'name', 'is', 'Joe', '.']]
What I am trying to do is remove all of the stop-words which are given in the default stop-words corpus in Python NLTK library which I have downloaded and imported:
stop_words = set(stopwords.words('english'))
For this, I used a nested for loop to open up the list and I am trying to match them with the stopwords. However, when I try to wrap it back up into a nested list, it only takes in the last list.
The code:
filtered_tokens = []
filtered_tokens_list = []
for token in tokens:
filtered_tokens.clear()
for t in token:
if t.upper() not in (name.upper() for name in stop_words):
filtered_tokens.append(t)
filtered_tokens_list.append(filtered_tokens)
filtered_tokens_list
The output:
[['name', 'Joe', '.'],
['name', 'Joe', '.'],
['name', 'Joe', '.'],
['name', 'Joe', '.'],
['name', 'Joe', '.'],
['name', 'Joe', '.'],
['name', 'Joe', '.'],
['name', 'Joe', '.'],
['name', 'Joe', '.']]
I tried to see how filtered_tokens_list
looks at each iteration by printing it out at each iteration
for token in tokens:
filtered_tokens.clear()
for t in token:
if t.upper() not in (name.upper() for name in stop_words):
filtered_tokens.append(t)
filtered_tokens_dict.append(filtered_tokens)
print(filtered_tokens_dict)
And the output is:
[["'s", 'name', '?']]
[['name', 'Aditya', '.'], ['name', 'Aditya', '.']]
[['name', 'Glen'], ['name', 'Glen'], ['name', 'Glen']]
[['name', 'Kenta', '.'], ['name', 'Kenta', '.'], ['name', 'Kenta', '.'], ['name', 'Kenta', '.']]
[['name', 'Keita'], ['name', 'Keita'], ['name', 'Keita'], ['name', 'Keita'], ['name', 'Keita']]
[['name', 'Ganchan'], ['name', 'Ganchan'], ['name', 'Ganchan'], ['name', 'Ganchan'], ['name', 'Ganchan'], ['name', 'Ganchan']]
[['name', 'Anna', '.'], ['name', 'Anna', '.'], ['name', 'Anna', '.'], ['name', 'Anna', '.'], ['name', 'Anna', '.'], ['name', 'Anna', '.'], ['name', 'Anna', '.']]
[['name', 'Tho'], ['name', 'Tho'], ['name', 'Tho'], ['name', 'Tho'], ['name', 'Tho'], ['name', 'Tho'], ['name', 'Tho'], ['name', 'Tho']]
[['name', 'Joe', '.'], ['name', 'Joe', '.'], ['name', 'Joe', '.'], ['name', 'Joe', '.'], ['name', 'Joe', '.'], ['name', 'Joe', '.'], ['name', 'Joe', '.'], ['name', 'Joe', '.'], ['name', 'Joe', '.']]
For some reason, the whole list is getting written over by the latest contents in filtered_tokens
The output I am looking for is:
[["'s", 'name', '?'],['name', 'Aditya', '.'],['name', 'Glen'],['name', 'Kenta', '.'],['name', 'Keita'],
['name', 'Ganchan'],['name', 'Anna', '.'],['name', 'Tho'],['name', 'Joe', '.']]
It's quite baffling and I haven't seen anything like this online. Would really appreciate the help!