I Have 2 large data frames. Each row has lat/lon data. My goal is to do a join between 2 dataframes and find all the points which are within a distance, e.g. 100m.
df1: (id, lat, lon, geohash7)
df2: (id, lat, lon, geohash7)
I want to partition df1 and df2 on geohash7, and then only join within the partitions. I want to avoid joining between partitions to reduce computation.
df1 = df1.repartition(200, "geohash7")
df2 = df2.repartition(200, "geohash7")
df_merged = df1.join(df2, (df1("geohash7")===df2("geohash7")) & (dist(df1("lat"),df1("lon"),df2("lat"),df2("lon"))<100) )
So basically join on geohash7 and then make sure distance between points is less than 100. The problem is that, Spark actually will cross join all the data. How can I make it only do inter-partition join not intra-partition join?