0

I am implementing a simple doc2vec with gensim, not a word2vec

I need to remove stopwords without losing the correct order to a list of list.

Each list is a document and, as I understood for doc2vec, the model will have as input a list of TaggedDocuments

model = Doc2Vec(lst_tag_documents, vector_size=5, window=2, min_count=1, workers=4)

dataset = [['We should remove the stopwords from this example'],
     ['Otherwise the algo'],
     ["will not work correctly"],
     ['dont forget Gensim doc2vec takes list_of_list' ]]

STOPWORDS = ['we','i','will','the','this','from']


def word_filter(lst):
  lower=[word.lower() for word in lst]
  lst_ftred = [word for word in lower if not word in STOPWORDS]
  return lst_ftred

lst_lst_filtered= list(map(word_filter,dataset))
print(lst_lst_filtered)

Output:

[['we should remove the stopwords from this example'], ['otherwise the algo'], ['will not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]

Expected Output:

[[' should remove the stopwords   example'], ['otherwise the algo'], [' not work correctly'], ['dont forget gensim doc2vec takes list_of_list']]

  • What was my mistake and how to fix?

  • There are other efficient ways to solve this issue without losing the proper order?


List of questions I examined before asking:

How to apply a function to each sublist of a list in python?

  • I studied this and tried to apply on my specific case

Removing stopwords from list of lists

  • The order is important I can't use set

Removing stopwords from a list of text files

  • This could be a possible solution is similar to what I have implemented.
  • I undestood that the difference, but I don't know how to deal with it. In my case the document is not tokenized (and should not be tokenized because is a doc2vec not a word2vec)

How to remove stop words using nltk or python

  • In this question the SO is dealing with a list not a list of list
Andrea Ciufo
  • 359
  • 1
  • 3
  • 19

2 Answers2

1

lower is a list of one element, word not in STOPWORDS will return False. Take the first item in the list with index and split by blank space

lst_ftred = ' '.join([word for word in lower[0].split() if word not in STOPWORDS])
# output: ['should remove stopwords example', 'otherwise algo', 'not work correctly', 'dont forget gensim doc2vec takes list_of_list']
# 'the' is also in STOPWORDS
Guy
  • 46,488
  • 10
  • 44
  • 88
1

First, note it's not that important to remove stopwords from Doc2Vec training. Second, note that such tiny toy datasets won't deliver interesting results from Doc2Vec. Tha algorithm, like Word2Vec, only starts to show its value when trained on large datasets with (1) many, many more unique words than the number of vector dimensions; and (2) lots of varied examples of the usage of each word - at least a few, ideally dozens or hundreds.

Still, if you wanted to strip stopwords, it'd be easiest if you did it after tokenizing the raw strings. (That is, splitting the strings into lists-of-words. That's the format Doc2Vec will need anyway.) And, you don't want your dataset to be a list-of-lists-with-one-string-each. Instead, you want it to be either a list-of-strings (at first), then a list-of-lists-with-many-tokens-each.

The following should work:

string_dataset = [
     'We should remove the stopwords from this example',
     'Otherwise the algo',
     "will not work correctly",
     'dont forget Gensim doc2vec takes list_of_list',
]

STOPWORDS = ['we','i','will','the','this','from']

# Python list comprehension to break into tokens
tokenized_dataset = [s.split() for s in string_dataset]

def filter_words(tokens):
    """lowercase each token, and keep only if not in STOPWORDS"""
    return [token.lower() for token in tokens if token not in STOPWORDS]

filtered_dataset = [filter_words(s) for sent in tokenized_dataset]

Finally, because as noted above, Doc2Vec needs multiple word examples to work well, it's almost always a bad idea to use min_count=1.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Tnx, yes I tried to create a Minimum Reproducible Example, but the dataset it's definitely bigger than +100k documents. What was not clear to me from the documentation, it that also in doc2vec you have to tokenize all the strings that constitute the document. – Andrea Ciufo Apr 29 '21 at 06:57
  • 1
    The docs for the `TaggedDocument` class (https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument) – which is the recommended type for `Doc2Vec` training examples) – describe its first argument, `words`, as 'a list of unicode string tokens`, and intro examples show preprocessing strings into lists-of-tokens, as in . Was there some part of the docs or examples that indicated to you otherwise? – gojomo Apr 29 '21 at 19:01
  • I am sure I wrongly interpreted an unofficial tutorial that I can't find anymore(that's why I think I made a mistake), because before asking this question I was wandering why I was working with list of list. BTW I will study again (and better) all the official tutorials – Andrea Ciufo May 02 '21 at 15:14