58

How to get document vectors of two text documents using Doc2vec? I am new to this, so it would be helpful if someone could point me in the right direction / help me with some tutorial

I am using gensim.

doc1=["This is a sentence","This is another sentence"]
documents1=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents1, size = 100, window = 300, min_count = 10, workers=4)

I get

AttributeError: 'list' object has no attribute 'words'

whenever I run this.

petezurich
  • 9,280
  • 9
  • 43
  • 57
bee2502
  • 1,145
  • 1
  • 10
  • 13

4 Answers4

45

If you want to train Doc2Vec model, your data set needs to contain lists of words (similar to Word2Vec format) and tags (id of documents). It can also contain some additional info (see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb for more information).

# Import libraries

from gensim.models import doc2vec
from collections import namedtuple

# Load data

doc1 = ["This is a sentence", "This is another sentence"]

# Transform data (you can add more data preprocessing steps) 

docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
    words = text.lower().split()
    tags = [i]
    docs.append(analyzedDocument(words, tags))

# Train model (set min_count = 1, if you want the model to work with the provided example data set)

model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)

# Get the vectors

model.docvecs[0]
model.docvecs[1]

UPDATE (how to train in epochs): This example became outdated, so I deleted it. For more information on training in epochs, see this answer or @gojomo's comment.

Lenka Vraná
  • 1,686
  • 2
  • 19
  • 29
  • I really like the idea of using `namedtuple` here but what confuses me is if there was a doc2? It looks like the `tags` is a an `id` for the sentence and not for the document. `docs` list makes it seem like it can be more than one doc in there. – O.rka Jul 30 '17 at 16:17
  • There are 2 different documents in doc1 in fact (not two sentences in one document). I don't know, why @bee2502 named this as `doc1`. However, you can guess this from line `documents1=[doc.strip().split(" ") for doc in doc1 ]` – Lenka Vraná Aug 02 '17 at 13:10
  • @LenkaVraná Many tanks for the great answer :) Do we have to train our doc2vec model several epoces? If so, how can we do to the above example? –  Oct 18 '17 at 08:15
  • @MMM Yes, I would recommend to do so. I have updated the answer. Enjoy! – Lenka Vraná Oct 19 '17 at 15:40
  • @LenkaVraná Thanks a lot :) –  Oct 20 '17 at 10:04
  • @LenkaVraná I am following your code to get my document vectors. The only change I did was assigning a string doc tag. However, I encountered the following issue while doing so. https://stackoverflow.com/questions/47332205/issues-in-doc2vec-tags-in-gensim (My changed code is also here) Can you please tell me how to overcome this issue? –  Nov 17 '17 at 02:32
  • 1
    @Volka: Tag is a list (the list of integers in this case, the list of strings in your case, but always the list). – Lenka Vraná Nov 17 '17 at 23:59
  • 1
    Nearly everyone should **not** try to manage `alpha` on their own, and **not** call `train()` multiple times in their own loop. Instead, call `train()` once with the desired `epochs` argument. It will smoothly manage learning-rate `alpha` from its starting value, to its final value, across all the repeated passes over the data. – gojomo Apr 25 '19 at 01:35
  • @LenkaVraná a question, should not the vector representation have the same length specified with `size = 100` in the training part? Thanks :) – Valerio Ficcadenti Dec 13 '22 at 12:21
36

Gensim was updated. The syntax of LabeledSentence does not contain labels. There are now tags - see documentation for LabeledSentence https://radimrehurek.com/gensim/models/doc2vec.html

However, @bee2502 was right with

docvec = model.docvecs[99] 

It will should the 100th vector's value for trained model, it works with integers and strings.

l.augustyniak
  • 1,794
  • 1
  • 15
  • 15
27
doc=["This is a sentence","This is another sentence"]
documents=[doc.strip().split(" ") for doc in doc1 ]
model = doc2vec.Doc2Vec(documents, size = 100, window = 300, min_count = 10, workers=4)

I got AttributeError: 'list' object has no attribute 'words' because the input documents to the Doc2vec() was not in correct LabeledSentence format. I hope this below example will help you understand the format.

documents = LabeledSentence(words=[u'some', u'words', u'here'], labels=[u'SENT_1']) 

More details are here : http://rare-technologies.com/doc2vec-tutorial/ However, I solved the problem by taking input data from file using TaggedLineDocument().
File format: one document = one line = one TaggedDocument object. Words are expected to be already preprocessed and separated by whitespace, tags are constructed automatically from the document line number.

sentences=doc2vec.TaggedLineDocument(file_path)
model = doc2vec.Doc2Vec(sentences,size = 100, window = 300, min_count = 10, workers=4)

To get document vector : You can use docvecs. More details here : https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.TaggedDocument

docvec = model.docvecs[99] 

where 99 is the document id whose vector we want. If labels are in integer format (by default, if you load using TaggedLineDocument() ), directly use integer id like I did. If labels are in string format,use "SENT_99" .This is similar to Word2vec

bee2502
  • 1,145
  • 1
  • 10
  • 13
  • 1
    Just to confirm, after training model_dm and model_dbow as shown in tutorial (https://linanqiu.github.io/2015/05/20/word2vec-sentiment/) I am getting the document vector back for the first training document using model_dm.docvecs['TRAIN_0']. Is this correct? – exAres Oct 12 '15 at 10:43
  • yes that is correct, and you could then compare several documents with a distance function etc. – Luke Barker Aug 17 '16 at 12:57
  • 2
    My training documents more than 5m, however when I use docvec = model.docvecs[11], it showed the 11 is our of bounds for axis 0 with size 10. I checked the docvecs size, only 10, it was supposed to more than 5m – Kun Sep 20 '16 at 21:35
  • 1
    @Kun Old topic but I had the same issue. Solution is to pass a list when creating a TaggedDocument. For example TaggedDocument(words, ["label_1"]) otherwise it will take every letter as a label. – user667804 Feb 14 '17 at 15:47
0
from gensim.models.doc2vec import Doc2Vec, TaggedDocument 
Documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(doc1)]
Model = Doc2Vec(Documents, other parameters~~)

This should work fine. You need to tag your documents for training doc2vec model.

Til
  • 5,150
  • 13
  • 26
  • 34