44

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec model in gensim. My dataset is in the form of a pandas dataset which has each document stored as a string on each line. This is the code I have so far

import gensim, re
import pandas as pd

# TOKENIZER
def tokenizer(input_string):
    return re.findall(r"[\w']+", input_string)

# IMPORT DATA
data = pd.read_csv('mp_1002_prepd.txt')
data.columns = ['merged']
data.loc[:, 'tokens'] = data.merged.apply(tokenizer)
sentences= []
for item_no, line in enumerate(data['tokens'].values.tolist()):
    sentences.append(LabeledSentence(line,[item_no]))

# MODEL PARAMETERS
dm = 1 # 1 for distributed memory(default); 0 for dbow 
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200

# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, # 
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter # number of iterations (epochs) over the corpus)

# QUERY BASED DOC RANKING ??

The part where I am struggling is in finding documents that are most similar/relevant to the query. I used the infer_vector but then realised that it considers the query as a document, updates the model and returns the results. I tried using the most_similar and most_similar_cosmul methods but I get words along with a similarity score(I guess) in return. What I want to do is when I enter a search string(a query), I should get the documents (ids) that are most relevant along with a similarity score(cosine etc). How do I get this part done?

Clock Slave
  • 7,627
  • 15
  • 68
  • 109
  • Does your query exists in the dataset? If so you can use the sentence_tag to find similar sentences. If not you could create a infer vector (after gensim 0.12.4) and query with it. Both using `model.docvecs.most_similar()` – umutto Mar 14 '17 at 09:03
  • @umutto my query is a string for example- customer segmentation. Customer and segmentation both exist in the vocabulary. By `sentence_tag` you mean the tag we pass in LabeledSentence, right? If so, then I have used document id(basically a number 1,2,3...num_docs) as the tag. I used `infer_vector` but that wasn't helpful because it considers the query as the document, updates the model weights and returns similar documents. I don't want to update the model every time I pass a query.Lastly, `model.docvecs.most_similar()` can be used, but it needs a vector to find the most similar docs – Clock Slave Mar 14 '17 at 11:01
  • @umutto So basically the question comes down to how do I get a vector representation of the query without altering the model. – Clock Slave Mar 14 '17 at 11:02
  • The infer method will ignore any words it does not have on vocsb and should not update weights afaik. passing the inffered vector to the most_similar function shd indeed give you back tags of similar doc. Have you tried that? What happens? Have you saved and loaded the model again? – Luke Barker Mar 15 '17 at 00:30
  • 1
    @ClockSlave currently I don't think there is any other way to get the vector representations. If you have a query that exists in your vocabulary than you can use their tag (document id in your case) to calculate similarity or to get their vectors. But I don't think infer vector would update the weights. You may see some difference results from same query due to non-deterministic nature of some algorithms used (negative sampling, dbow=1 etc...). But that does not mean the model is altered. – umutto Mar 15 '17 at 01:37
  • @umutto the `infer_vector` method takes parameters like `alpha`, `min_alpha` so i figured they update the model as well. However I am not sure if they are learning rates or some other parameters – Clock Slave Mar 15 '17 at 18:40

1 Answers1

55

You need to use infer_vector to get a document vector of the new text - which does not alter the underlying model.

Here is how you do it:

tokens = "a new sentence to match".split()

new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity

Edit:

Here is an example of how the underlying model does not change after infer_vec is called.

import numpy as np

words = "king queen man".split()

len_before =  len(model.docvecs) #number of docs

#word vectors for king, queen, man
w_vec0 = model[words[0]]
w_vec1 = model[words[1]]
w_vec2 = model[words[2]]

new_vec = model.infer_vector(words)

len_after =  len(model.docvecs)

print np.array_equal(model[words[0]], w_vec0) # True
print np.array_equal(model[words[1]], w_vec1) # True
print np.array_equal(model[words[2]], w_vec2) # True

print len_before == len_after #True
Erock
  • 770
  • 7
  • 10
  • 1
    are you sure that it doesn't update the model. The `infer_vector` method takes parameters like `alpha` and `min_alpha`. I'm assuming they are learning rates. There's not much given in the documentation so I am not really sure if they are learning rates or some other parameters. Also, I came to think that it was updating the model because every time I passed the same sentence to `infer_vector` and then to `most_similar`, I got different results each time – Clock Slave Mar 15 '17 at 18:39
  • 3
    `infer_vector` like the training is has non-deterministic elements. You will get different vectors on each call. There are a number of discussions out there on Gensim's mailing list and their issue log on github. Here is a good one good example: https://github.com/RaRe-Technologies/gensim/issues/447. Also, you can test if the model changes. See my edit. – Erock Mar 15 '17 at 19:53
  • 5
    it's clearly stated in doc2vec paper that at inference time, all the parameters of the model are fixed. So the model definitely doesn't get updated. – Antoine Jan 02 '18 at 12:41
  • @ClockSlave Yes, infer_vector is changing the model. I am reloading the model, after infer_vector & the output is deterministic. Very useful post! – user2849678 Mar 21 '19 at 08:45