1

I am in serious trouble. I want to calculate the relationship of ten million records, but processing stops because Spark's memory is insufficient. Ten million documents are created by TF - Hashing to create 20,000 - dimensional features. First of all, I tried '- Approximate similarity join', but the calculation did not converge. Next, I attempted to try KNN of scikit-learn, but when I brought all the data to Driver, memory was overflowing. Is there no other way to do it?

1 Answers1

0

Nearest Neighbor does not seem to be part of Spark's MLLib. Options I think of are to find a distributed spark implementation or find a tensorflow implementation

Are on Databricks? recent versions support distributed Tensorflow. I've have run larger volumes than yours on a single node Databricks Tensorflow cluster.

quick searching turned up these * tensorflow nearest neighbor * spark nearest neighbor

Note that I have not tried these myself.

Michael West
  • 1,636
  • 16
  • 23