I'm new to NLP but I'm trying to match a list of sentences to another list of sentences in Python based on their semantic similarity. For example,
list1 = ['what they ate for lunch', 'height in inches', 'subjectid']
list2 = ['food eaten two days ago', 'height in centimeters', 'id']
Based on previous posts and prior knowledge, it seemed the best way was to create document vectors of each sentence and compute the cosine similarity score between lists. Other posts I've found with regards to Doc2Vec, as well as the tutorial, seem focused on prediction. This post does a good job doing the calculation by hand, but I thought it was possible for Doc2Vec to do that already. The code I'm using is
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
def build_model(train_docs, test_docs, comp_docs):
'''
Parameters
-----------
train_docs: list of lists - combination of known both sentence list
test_docs: list of lists - one of the sentence lists
comp_docs: list of lists - combined sentence lists to match the index to the sentence
'''
# Train model
model = Doc2Vec(dm = 0, dbow_words = 1, window = 2, alpha = 0.2)#, min_alpha = 0.025)
model.build_vocab(train_docs)
for epoch in range(10):
model.train(train_docs, total_examples = model.corpus_count, epochs = epoch)
#model.alpha -= 0.002
#model.min_alpha = model.alpha
scores = []
for doc in test_docs:
dd = {}
# Calculate the cosine similarity and return top 40 matches
score = model.docvecs.most_similar([model.infer_vector(doc)],topn=40)
key = " ".join(doc)
for i in range(len(score)):
# Get index and score
x, y = score[i]
#print(x)
# Match sentence from other list
nkey = ' '.join(comp_docs[x])
dd[nkey] = y
scores.append({key: dd})
return scores
which works to calculate the similarity scores, but the issue here is that I have to train the model on all the sentences from both lists or one of the lists, then match. Is there a way to use Doc2Vec to just get the vectors, then compute the cosine similarity? To be clear, I'm trying to find the most similar sentences between lists. I'd expect an output like
scores = []
for s1 in list1:
for s2 in list2:
scores.append((s1, s2, similarity(s1, s2)))
print(scores)
[('what they ate for lunch', 'food eaten two days ago', 0.23567),
('what they ate for lunch', 'height in centimeters', 0.120),
('what they ate for lunch', 'id', 0.01023),
('height in inches', 'food eaten two days ago', 0.123),
('height in inches', 'height in centimeters', 0.8456),
('height in inches', 'id', 0.145),
('subjectid', 'food eaten two days ago', 0.156),
('subjectid', 'height in centimeters', 0.1345),
('subjectid', 'id', 0.9567)]