Using gensim's Doc2Vec to produce sentence vectors

Question

I'm trying to use Doc2Vec to read in a file that is a list of sentences like this:

The elephant flaps its large ears to cool the blood in them and its body.

A house is a permanent building or structure for people or families to live in.

...

What I want to do is generate two files, one with unique words from these sentences and another that has one corresponding vector per line (if there's no vector output I want to output a vector of 0's)

I'm getting the vocab fine with my code but I can't seem to figure out how to print out the individual sentence vectors. I have looked through the documentation and haven't found much help. Here is what my code looks like so far.

sentences = []
for uid, line in enumerate(open(filename)):
    sentences.append(LabeledSentence(words=line.split(), labels=['SENT_%s' %       uid]))

model = Doc2Vec(alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
for epoch in range(10):
    model.train(sentences)
    model.alpha -= 0.002
    model.min_alpha = model.alpha
sent_reg = r'[SENT].*'
for item in model.vocab.keys():
    sent = re.search(sent_reg, item)
    if sent:
        continue
    else:
        print item

###I'm not sure how to produce the vectors from here and this doesn't work##   
sent_id = 0
for item in model:
    print model["SENT_"+str(sent_id)]
    sent_id += 1

Have you tried setting min_count = 1? Doc2Vec(min_count = 1) — slizb, Aug 20 '15 at 13:16

score 3 · Answer 1 · answered Aug 31 '15 at 21:29

3

With the latest gensim (0.12.1) you could try:

print model.docvecs["SENT_"+str(sent_id)]

answered Aug 31 '15 at 21:29

Nicholas

313
2
9

Yes, the document-vectors are now in a `model.docvecs` sub-property. – gojomo Oct 01 '15 at 00:21
@gojomo @Nicholas why is it that `model.docvecs` doesn't necessarily have the same number of rows as I have of numbers of sentences? – Katya Willard May 12 '16 at 20:32
The size of `docvecs` will be the number of unique doctags seen during the initial scan-of-examples, with the added caveat that if you use plain-ints as doctags, it will allocate space for all ints up to the largest. So if you provide 10 examples, but there are only 2 unique string doctags repeated, there will only be 2 rows in `docvecs.doctag_syn0`. If you provide just 1 example, but with the int doctag `10`, there will be 11 rows (for ints 0-10). (In the base case of every sentence getting its own int ID counting up from 0, the `docvecs` rows will exactly match the number of examples given.) – gojomo May 12 '16 at 20:57
@gojomo thank you very much for your reply. Right now my doctags are `ids = [str(x) for x in range(0, len(sentences)]`, where `len(sentences)` is 150,000. But I have a unique tag for each sentence. Still getting `len(model.docvecs)` as 10. Looking for answer to this question here. http://stackoverflow.com/questions/37196520/understanding-the-output-of-doc2vec-from-gensim-package – Katya Willard May 13 '16 at 12:11
Because a text example *may* have more than one tag, the `tags` are considered to be a sequence – so your numerical IDs are being turned into the 10 tags `'0'`, `'1'`, ..., `'9'`. `ids = [ [str(x), ] for x in range(len(sentences)]` should work – but because it's OK to use raw ints as the tags too, so would `ids = [ [x, ] for x in range(len(sentences)]`. – gojomo May 14 '16 at 23:24

Using gensim's Doc2Vec to produce sentence vectors

1 Answers1