Using Custom Word2Vec to find semantic similarity between technical questions?

Question

We can get the similarity of two sentences like "The boy is playing football" and "A kid is playing football" using Google news vectors by applying "SIF Embeddings".

I would like to get the similarity for two sentences which are technical like "what is an abstract class?" and "what is a class?".

I have used Google-news Vectors in getting the similarity but it didn't work well.

I would like to know how training data should be?

You should give a minimal working example, otherwise this type of generic question is more appropriate in a theoretical rather than applied context: https://meta.stackexchange.com/questions/130524/which-stack-exchange-website-for-machine-learning-and-computational-algorithms — jonnybazookatone, Oct 31 '17 at 07:52
I have edited the question a bit. can you please look into it. — Poorna Prudhvi, Oct 31 '17 at 07:59

de1 · Accepted Answer · 2017-10-31T08:18:16.900

1

Word2Vec is an algorithm that generates vectors for words, that tend to be similar for similar words. It does not do sentences on its own.

You have more or less the following options:

Create a sentence vector
Compare similarity of word vectors within two sentences

Create a sentence vector

You could build sentence, paragraph or document vectors. There are different approaches to that. You could for example combine the word2vec of of the individual words. If you just want a solution you could go for gensim's doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html

Other than that there are methods like concatenating all the word vectors (of a fixed length).

Similar questions: How to calculate the sentence similarity using word2vec model of gensim with python

Compare similarity of word vectors within two sentences

One such approach is Movers Distance: Pairwise Earth Mover Distance across all documents (word2vec representations)

This seems like a good, but expensive approach.

Update: You've updated your question since to mention that you are using "SIF Embeddings" (instead of word2vec): https://openreview.net/forum?id=SyK00v5xx

edited Oct 31 '17 at 08:18

answered Oct 31 '17 at 08:02

de1

2,986
1
15
32

My problem is how to deal with technical sentences not how to use word2vec to get sentence vectors – Poorna Prudhvi Oct 31 '17 at 08:04
Have you tried it? I don't see how technical sentences are much different to other sentences. You might just need to train it on the right corpus. – de1 Oct 31 '17 at 08:06
I have tried for the technical questions it's not working well. I see a lot oov tokens – Poorna Prudhvi Oct 31 '17 at 08:06
1

In that case you should really clarify your question and provide exact examples of what you tried, how and with what you trained it etc with some code examples. Your title alone seem to include two questions - whether word2vec can be used for similarity between sentences, and then you have that word 'technical'. I don't see why it shouldn't work for technical questions. – de1 Oct 31 '17 at 08:10
Because the data google trained on is different. I need to know the characteristics of the data to train a custom word2vec – Poorna Prudhvi Oct 31 '17 at 08:23

Using Custom Word2Vec to find semantic similarity between technical questions?

1 Answers1

Create a sentence vector

Compare similarity of word vectors within two sentences