Need help in creating an appropriate model to predict semantic similarity between two sentences

Question

I am new to ML field and trying my hands on creating a model which will predict semantic similarity between two sentences. I am using following approach:

1.Using word2vec model in gensim package vectorise each word present in the sentences in question

2.Calculate the average vector for all words in every sentence/document

import numpy as np
from scipy import spatial

index2word_set = set(model.wv.index2word)

def avg_feature_vector(sentence, model, num_features, index2word_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2word_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

3.Next calculate cosine similarity between these two average vectors

s1_afv = avg_feature_vector('this is a sentence', model=model, 
num_features=300, index2word_set=index2word_set)
s2_afv = avg_feature_vector('this is also sentence', model=model, 
num_features=300, index2word_set=index2word_set)
sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
print(sim)

Reference stackoverflow question: How to calculate the sentence similarity using word2vec model of gensim with python

Help needed for the following challenge:

As I want to create a model which would predict semantic similarity between two sentences, I am not quite sure about:

1.Which model would be best suited for this problem

2.Next more importantly how to train that model?

Should I create a matrix where each row will contain two sentences: sen1 and sen2 and I would vectorise them and calculate cosine similarity(as per the above mentioned approach)

Then for training data:

X_Train: avg vectors for sen1 and sen2 and their cosine similarity value

y_Train(prediction) : a set of binary values(1 or similar if cosine similarity > 0.7 and 0 otherwise)

I am quite confused whether my approach is correct and how to put a proper approach in the form of a working codebase.

Internet and materials available online are my only teachers to learn ML; thus requesting your guidance in help clearing my gap in understanding and help in coming up with a good working model for my problem.

score 3 · Accepted Answer · answered Oct 09 '18 at 20:28

Your general approach is reasonable. An average of the word-vectors in a sentence often works OK as a rough summary vector of the sentence. (There are many other possible techniques which might do better, but that's a good easy start.)

You can use someone else's pre-trained word-vectors, but if you have a good large training set of text from your domain, those word-vectors may work better. You should look for a tutorial on how to train your own word-vectors with gensim. For example, there'a a demo Jupyter notebook word2vec.ipynb included with it, in its docs/notebooks directory, which you can also view online at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb

Your current avg_feature_vector() function has a number of problems. In particular:

if you pass in the model, it already includes within it the fixed index2word list, and an already-determined number-of-dimensions – so no need to pass those in redundantly
you're looping over all words in the model, rather than just the ones in your sentence, so not calculating just based on your sentence
there are better, more pythonic ways to do the various array math operations you're attempting - including in the numpy library a simple mean() function that will spare you the adding/dividing of creating the average

You may want to fix those problems, as an exercise, but you could also use utility methods on the word-vectors model instead. In particular, look at n_similarity() - it specifically takes two sets-of-words, automatically averages each set, then reports the similarity-value (closer to 1.0 for more-similar, closer to -1.0 for least-similar) between the two sets. See:

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.n_similarity

So if you had two sentences (as strings) in sent1 and sent2, and a set of word-vectors (either just-trained by you, or loaded from elsewhere) in kv_model, you could get the sentence's similarity via:

kv_model.n_similarity(sent1.split(), sent2.split())

(You might still get errors if any of the word-tokens aren't known by the model.)

Whether you actually create average vectors for different sentences and store them in some list/dict/dataframe/etc, or simply remember the pairwise-similarities somewhere, will depend on what you want to do next.

And, after you have the basics working on this simple measure of text-similarity, you could look into other techniques. For example, another way to compare two texts using word-vectors – but not via the simple average – is called "Word Mover's Distance". (It's quite a bit slower to calculate, though.)

Another technique for collapsing texts into a single vector, for the purposes of comparison, is available in gensim as Doc2Vec – it works a lot like Word2Vec but also creates vectors-per-longer-text, instead of just vectors-per-individual-word.

score 1 · Answer 2 · answered Feb 07 '20 at 11:42

First of all thank you for asking this question, I am dabbling with same problem, first and foremost it is not a simple problem to solve because it deals with a nuance of a language and how to make a machine understand human language.

Since the time this question was asked to this time there have been a lot of changes that has happened in the ML/AI world so I thought updating this answer might help someone.

The problem with your approach is getting the average of all words in a sentence and averaging it out to get a derived vector for your sentence, given the tool of that time it might be okay but you should have gone with something more sophisticated like Doc2Vec from gensim.

For todays time I feel there exist way more sophisticated and effective word embedding rather sentence embedding that you can use.

List of few of them have been very cleverly curated at this github repo from HuggingFace.

There is also an amazing discussion which I would love to point out to people reading this:

https://github.com/huggingface/transformers/issues/876

People often try to use a combination of ElasticSearch BM25 + Embeddings, but practical accuracy is still lower.

I am still in search of a semantic search technique which can help me do a semantic search over my domain data.

The closes thing is Transfer Learning where you can use a pre-trained model and add your domain data over it to finetune that model.

You can have a look at the example: https://github.com/UKPLab/sentence-transformers/blob/master/examples/application_semantic_search.py

But this way also we are not able to consider ontology and various scenarios where the word-overlap should come into picture.

Need help in creating an appropriate model to predict semantic similarity between two sentences

2 Answers2