I am trying to find all the similar sentences amongst a set of sentences, and I am wondering how I could optimize it.
I am using a Word2Vec model, so in order to find similar sentences I sum all the vectors in the 1st sentence and 2nd sentence, then do the cosine of both, and if the result is higher than 0.9 I add it to the list of similar sentences.
The problem is right now I am comparing all the sentences with the others, meaning a O(n^2) complexity, which is not so good if I have a large set of sentences.
So my question : is there any way to pre-process the set of sentences in order to reduce the number of comparisons (and get a O(nlogn) complexity)?
I could not get my head around this as I am pretty new with this Word2Vec representation and I do not really see a way to sort the sentences in a way that would help.