0

Consider I have 10 million items, each identified with a 100 dimension vector of real numbers (actually they are word2vec embeddings). For each item I want to get (approximately) the top 200 most similar items to it, using Cosine similarity. My current cosine similarity standard implementation as UDF function in Hadoop (hive) takes about 1s to calculate the cosine similarity of 1 item compared with 10 million other items. This renders it infeasible to run for whole matrix. My next move is to run it on Spark, with more parallelization, but still it won't solve the problem completely.

I know there are some methods to reduce the calculation for a spars matrix. But my matrix is NOT sparse.

How can I efficiently get the most similar items for each item? Is there an approximation of cosine similarity that will be more efficient to calculate?

cybergeek654
  • 290
  • 1
  • 5
  • 15

1 Answers1

0

You can compress the vector to make the score calculation simpler. By new distance approach like hamming distance.

There is a keyword called vector quantization, and there are many algorithms talk about vector compression.

Here is an example of making it comparable to cosine similarity.

https://github.com/tdebatty/java-LSH/blob/master/src/main/java/info/debatty/java/lsh/SuperBit.java#L208

Steven Chou
  • 1,504
  • 2
  • 21
  • 43