0

I am trying to find all the similar sentences amongst a set of sentences, and I am wondering how I could optimize it.

I am using a Word2Vec model, so in order to find similar sentences I sum all the vectors in the 1st sentence and 2nd sentence, then do the cosine of both, and if the result is higher than 0.9 I add it to the list of similar sentences.

The problem is right now I am comparing all the sentences with the others, meaning a O(n^2) complexity, which is not so good if I have a large set of sentences.

So my question : is there any way to pre-process the set of sentences in order to reduce the number of comparisons (and get a O(nlogn) complexity)?

I could not get my head around this as I am pretty new with this Word2Vec representation and I do not really see a way to sort the sentences in a way that would help.

realUser404
  • 2,111
  • 3
  • 20
  • 38
  • As an improvement to the method you are using to find the similar sentences can be better if you implement https://stackoverflow.com/a/44383363/5443381 – Poorna Prudhvi Jul 24 '17 at 06:40

1 Answers1

0

Unfortunately, due to issues in high-dimensional spaces (related to the "curse of dimensionality"), there's no simple/easy way to do a lot better than doing the bulk, all-pairwise comparisons.

There are some techniques for pre-building approximate indexes; see for example the ANNOY library or Facebook's FAISS (non-commercial license only at this time). With these, extra time up-front and index-space can speed later nearest-neighbor queries – but at a cost of full accuracy.

Otherwise, you're stuck with smart batching & caching to help you avoid unnecessary re-calculations, or throwing a lot of machines at big datasets to reduce perceived wall-time.

(Separately: you may not want to use a absolute threshold, like 0.9, as opposed to reviewing the top-N for any vector. The ranges and human-judgement-correlated interpretation of absolute-similarities may not be as stable across model meta-parameters, or regions of space, as relative-rankings.)

gojomo
  • 52,260
  • 14
  • 86
  • 115