Questions tagged [doc2vec]

Doc2Vec is an unsupervised algorithm used to convert documents in vectors ("dense embeddings"). It is based on the "Paragraph Vector" paper and implemented in the Gensim Python library and elsewhere. The algorithm can work in either a "Distributed Bag Of Words" mode (PV-DBOW, which works somewhat analogously to skip-gram mode in Word2Vec) or a "Distributed Memory" mode (PV-DM, which is more analogous to CBOW mode in Word2Vec.)

556 questions
125
votes
10 answers

ImportError: cannot import name 'joblib' from 'sklearn.externals'

I am trying to load my saved model from s3 using joblib import pandas as pd import numpy as np import json import subprocess import sqlalchemy from sklearn.externals import joblib ENV = 'dev' model_d2v = load_d2v('model_d2v_version_002', ENV) def…
Praneeth Sai
  • 1,421
  • 2
  • 7
  • 11
55
votes
1 answer

gensim Doc2Vec vs tensorflow Doc2Vec

I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. It seems atleast visually that the gensim ones are performing better. I ran the following code to train the gensim model and the one below that for tensorflow…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
46
votes
4 answers

How to use Gensim doc2vec with pre-trained word vectors?

I recently came across the doc2vec addition to Gensim. How can I use pre-trained word vectors (e.g. found in word2vec original website) with doc2vec? Or is doc2vec getting the word vectors from the same sentences it uses for paragraph-vector…
Stergios
  • 3,126
  • 6
  • 33
  • 55
44
votes
1 answer

Doc2Vec Get most similar documents

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec model in gensim. My dataset is in the form of a…
Clock Slave
  • 7,627
  • 15
  • 68
  • 109
17
votes
2 answers

Is there pre-trained doc2vec model?

Is there a pre-trained doc2vec model with a large data set, like Wikipedia or similar?
Idriss Brahimi
  • 171
  • 1
  • 1
  • 5
17
votes
3 answers

How to use TaggedDocument in gensim?

I have two directories from which I want to read their text files and label them, but I don't know how to do this via TaggedDocument. I thought it would work as TaggedDocument([Strings],[Labels]) but this doesn't work apparently. This is my code:…
Farhood
  • 391
  • 2
  • 4
  • 16
15
votes
2 answers

How does gensim calculate doc2vec paragraph vectors

i am going thorugh this paper http://cs.stanford.edu/~quocle/paragraph_vector.pdf and it states that " Theparagraph vector and word vectors are averaged or concatenated to predict the next word in a context. In the experiments, we use …
jxn
  • 7,685
  • 28
  • 90
  • 172
13
votes
1 answer

How to break conversation data into pairs of (Context , Response)

I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions. Figure 1: shows a sample conversations where the user question is answered in the next…
Shlomi Schwartz
  • 8,693
  • 29
  • 109
  • 186
12
votes
2 answers

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb I modified the code in line 10 to determine best matching document for the given…
Rohan
  • 665
  • 9
  • 17
10
votes
1 answer

How to use the infer_vector in gensim.doc2vec?

def cosine(vector1,vector2): cosV12 = np.dot(vector1, vector2) / (linalg.norm(vector1) * linalg.norm(vector2)) return cosV12 model=gensim.models.doc2vec.Doc2Vec.load('Model_D2V_Game') string='民生 为了 父亲 我 要 坚强 地 ...' list=string.split('…
Jeffery
  • 151
  • 1
  • 1
  • 7
9
votes
2 answers

Why Doc2vec gives 2 different vectors for the same texts

I am using Doc2vec to get vectors from words. Please see my below code: from gensim.models.doc2vec import TaggedDocument f = open('test.txt','r') trainings = [TaggedDocument(words = data.strip().split(","),tags = [i]) for i,data in…
Thanh Bui
  • 103
  • 5
9
votes
1 answer

Improving Gensim Doc2vec results

I tried to apply doc2vec on 600000 rows of sentences: Code as below: from gensim import models model = models.Doc2Vec(alpha=.025, min_alpha=.025, min_count=1, workers = 5) model.build_vocab(res) token_count = sum([len(sentence) for sentence in…
Hackerds
  • 1,195
  • 2
  • 16
  • 34
9
votes
1 answer

what is the minimum dataset size needed for good performance with doc2vec?

How does doc2vec perform when trained on different sized datasets? There is no mention of dataset size in the original corpus, so I am wondering what is the minimum size required to get good performance out of doc2vec.
pete the dude
  • 139
  • 3
  • 7
9
votes
1 answer

Doc2Vec Worse Than Mean or Sum of Word2Vec Vectors

I'm training a Word2Vec model like: model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1) and Doc2Vec model like: doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4,…
ScientiaEtVeritas
  • 5,158
  • 4
  • 41
  • 59
9
votes
3 answers

Document similarity: Vector embedding versus Tf-Idf performance?

I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches: A vector embedding (word2vec, GloVe or fasttext), averaging over word…
Alec Matusis
  • 781
  • 1
  • 7
  • 16
1
2 3
37 38