Consider I have 10 million items, each identified with a 100 dimension vector of real numbers (actually they are word2vec embeddings). For each item I want to get (approximately) the top 200 most similar items to it, using Cosine similarity. My current cosine similarity standard implementation as UDF function in Hadoop (hive) takes about 1s to calculate the cosine similarity of 1 item compared with 10 million other items. This renders it infeasible to run for whole matrix. My next move is to run it on Spark, with more parallelization, but still it won't solve the problem completely.
I know there are some methods to reduce the calculation for a spars matrix. But my matrix is NOT sparse.
How can I efficiently get the most similar items for each item? Is there an approximation of cosine similarity that will be more efficient to calculate?