Apache Spark: JOINing RDDs (data sets) using custom criteria/fuzzy matching

Question

Is it possible to join two (Pair)RDDs (or Datasets/DataFrames) (on multiple fields) using some "custom criteria"/fuzzy matching, e.g. range/interval for numbers or dates and various "distance methods", e.g. Levenshtein, for strings?

For "grouping" within an RDD to get a PairRDD, one can implement a PairFunction, but it seems that something similar is not possible when JOINing two RDDs/data sets? I am thinking something like:

rdd1.join(rdd2, myCustomJoinFunction);

I was thinking about implementing the custom logic in hashCode() and equals() but I am not sure how to make "similar" data wind up in the same bucket. I have also been looking into RDD.cogroup() but have not figured out how I could use it to implement this.

I just came across elasticsearc-hadoop. Does anyone know if that library could be used to do something like this?

I am using Apache Spark 2.0.0. I am implementing in Java but an answer in Scala would also be very helpful.

PS. This is my first Stackoverflow question so bear with me if I have made some newbie mistake :).

I have seen [Alternatives to RDD.cartesian for fuzzy join in ApacheSpark](http://stackoverflow.com/questions/33376727/alternatives-to-rdd-cartesian-for-fuzzy-join-in-apachespark) but both my RDDs/data set are going to be too large to `collect()` in memory. — Morten Garbøl Franck, Sep 01 '16 at 20:24

score 0 · Accepted Answer · edited May 23 '17 at 12:32

0

For DataFrames/Datasets you can use join with custom join function. Create an UDF that will be using columns from DataFrame, just like in this question in first answer.

You can also do

rdd1.cartesian(rdd2).filter (...)

Remember that it will consume much time to calculate

edited May 23 '17 at 12:32

Community

1
1

answered Sep 02 '16 at 19:24

T. Gawęda

15,706
4
46
61

Thank you! For my prototype, I am using the first answer from [custom join with non equal keys](http://stackoverflow.com/questions/30132533/custom-join-with-non-equal-keys). – Morten Garbøl Franck Sep 05 '16 at 09:41

Apache Spark: JOINing RDDs (data sets) using custom criteria/fuzzy matching

1 Answers1