I am doing fuzzy string match using MinHashLSH
and approxSimilarityJoin
on 500 billion pairs. It is too big for my current cluster setup, thus, I want to run it in batches
I want to partition the data, and run approxSimilarityJoin
on each partition iteratively such that my cluster can handle it.
My current function is:
matched_df = model.stages[-1].approxSimilarityJoin(df1, df2, 1.0, "confidence")
But I am stuck on how to combine repartition
, foreachPartition
and approxSimilarityJoin
.
I think it should be something like:
df1.repartition(100).foreachPartition(batch : model.stages[-1].approxSimilarityJoin(batch, df2, 1.0, "confidence"))
but I have the wrong syntax. What is the correct syntax for foreachPartition
?