How Can I get the nearest neighbor matrix from 1,000,000 rows and 20,000 features?

Question

I am in serious trouble. I want to calculate the relationship of ten million records, but processing stops because Spark's memory is insufficient. Ten million documents are created by TF - Hashing to create 20,000 - dimensional features. First of all, I tried '- Approximate similarity join', but the calculation did not converge. Next, I attempted to try KNN of scikit-learn, but when I brought all the data to Driver, memory was overflowing. Is there no other way to do it?

Can you clarify your abbreviations? Is it TF - tensorflow? KNN - k-nearest neighbor — Michael West, Dec 22 '18 at 14:40
Take a look at [Efficient string matching in Apache Spark](https://stackoverflow.com/q/43938672/10465355) — 10465355, Dec 22 '18 at 16:46
kNN is not part of the native Spark MLlib and ml packages. LinkedIn has published a kNN implementation for Spark: https://github.com/linkedin/scanns — andrew, Dec 22 '18 at 20:21
@MichaelWest TF means Text Frequency, sorry. KNN is k-nearest neighbor. — tatsuya.takahashi, Dec 29 '18 at 10:02

score 0 · Accepted Answer · answered Dec 22 '18 at 14:52

0

Nearest Neighbor does not seem to be part of Spark's MLLib. Options I think of are to find a distributed spark implementation or find a tensorflow implementation

Are on Databricks? recent versions support distributed Tensorflow. I've have run larger volumes than yours on a single node Databricks Tensorflow cluster.

quick searching turned up these * tensorflow nearest neighbor * spark nearest neighbor

Note that I have not tried these myself.

answered Dec 22 '18 at 14:52

Michael West

1,636
16
23

Finally, I could solve it with Spark's MLLib's Jaccard and Minhash methods. thx for your kindeness. – tatsuya.takahashi Jan 30 '19 at 13:07
Glad to hear it worked out. Thanks for letting us know! – Michael West Feb 04 '19 at 15:26

How Can I get the nearest neighbor matrix from 1,000,000 rows and 20,000 features?

1 Answers1