I am implementing a simple doc2vec
with gensim
, not a word2vec
I need to remove stopwords without losing the correct order to a list of list.
Each list is a document and, as I understood for doc2vec, the model will have as input a list of TaggedDocuments
model = Doc2Vec(lst_tag_documents, vector_size=5, window=2, min_count=1, workers=4)
dataset = [['We should remove the stopwords from this example'],
['Otherwise the algo'],
["will not work correctly"],
['dont forget Gensim doc2vec takes list_of_list' ]]
STOPWORDS = ['we','i','will','the','this','from']
def word_filter(lst):
lower=[word.lower() for word in lst]
lst_ftred = [word for word in lower if not word in STOPWORDS]
return lst_ftred
lst_lst_filtered= list(map(word_filter,dataset))
print(lst_lst_filtered)
Output:
[['we should remove the stopwords from this example'], ['otherwise the algo'], ['will not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]
Expected Output:
[[' should remove the stopwords example'], ['otherwise the algo'], [' not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]
What was my mistake and how to fix?
There are other efficient ways to solve this issue without losing the proper order?
List of questions I examined before asking:
How to apply a function to each sublist of a list in python?
- I studied this and tried to apply on my specific case
Removing stopwords from list of lists
- The order is important I can't use set
Removing stopwords from a list of text files
- This could be a possible solution is similar to what I have implemented.
- I undestood that the difference, but I don't know how to deal with it. In my case the document is not tokenized (and should not be tokenized because is a doc2vec not a word2vec)
How to remove stop words using nltk or python
- In this question the SO is dealing with a list not a list of list