I am in serious trouble. I want to calculate the relationship of ten million records, but processing stops because Spark's memory is insufficient. Ten million documents are created by TF - Hashing to create 20,000 - dimensional features. First of all, I tried '- Approximate similarity join', but the calculation did not converge. Next, I attempted to try KNN of scikit-learn, but when I brought all the data to Driver, memory was overflowing. Is there no other way to do it?
Asked
Active
Viewed 659 times
1
-
Can you clarify your abbreviations? Is it TF - tensorflow? KNN - k-nearest neighbor – Michael West Dec 22 '18 at 14:40
-
Take a look at [Efficient string matching in Apache Spark](https://stackoverflow.com/q/43938672/10465355) – 10465355 Dec 22 '18 at 16:46
-
kNN is not part of the native Spark MLlib and ml packages. LinkedIn has published a kNN implementation for Spark: https://github.com/linkedin/scanns – andrew Dec 22 '18 at 20:21
-
@MichaelWest TF means Text Frequency, sorry. KNN is k-nearest neighbor. – tatsuya.takahashi Dec 29 '18 at 10:02
-
@user10465355 Thank you for your answer. – tatsuya.takahashi Jan 01 '19 at 07:20
-
@andrew Thank you for you kindness. I'm trying it. – tatsuya.takahashi Jan 01 '19 at 07:21
1 Answers
0
Nearest Neighbor does not seem to be part of Spark's MLLib. Options I think of are to find a distributed spark implementation or find a tensorflow implementation
Are on Databricks? recent versions support distributed Tensorflow. I've have run larger volumes than yours on a single node Databricks Tensorflow cluster.
quick searching turned up these * tensorflow nearest neighbor * spark nearest neighbor
Note that I have not tried these myself.

Michael West
- 1,636
- 16
- 23
-
Finally, I could solve it with Spark's MLLib's Jaccard and Minhash methods. thx for your kindeness. – tatsuya.takahashi Jan 30 '19 at 13:07
-