I am new to ML field and trying my hands on creating a model which will predict semantic similarity between two sentences. I am using following approach:
1.Using word2vec model in gensim package vectorise each word present in the sentences in question
2.Calculate the average vector for all words in every sentence/document
import numpy as np
from scipy import spatial
index2word_set = set(model.wv.index2word)
def avg_feature_vector(sentence, model, num_features, index2word_set):
words = sentence.split()
feature_vec = np.zeros((num_features, ), dtype='float32')
n_words = 0
for word in words:
if word in index2word_set:
n_words += 1
feature_vec = np.add(feature_vec, model[word])
if (n_words > 0):
feature_vec = np.divide(feature_vec, n_words)
return feature_vec
3.Next calculate cosine similarity between these two average vectors
s1_afv = avg_feature_vector('this is a sentence', model=model,
num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model,
num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)
Reference stackoverflow question: How to calculate the sentence similarity using word2vec model of gensim with python
Help needed for the following challenge:
As I want to create a model which would predict semantic similarity between two sentences, I am not quite sure about:
1.Which model would be best suited for this problem
2.Next more importantly how to train that model?
Should I create a matrix where each row will contain two sentences: sen1 and sen2 and I would vectorise them and calculate cosine similarity(as per the above mentioned approach)
Then for training data:
X_Train: avg vectors for sen1 and sen2 and their cosine similarity value
y_Train(prediction) : a set of binary values(1 or similar if cosine similarity > 0.7 and 0 otherwise)
I am quite confused whether my approach is correct and how to put a proper approach in the form of a working codebase.
Internet and materials available online are my only teachers to learn ML; thus requesting your guidance in help clearing my gap in understanding and help in coming up with a good working model for my problem.