Is it possible to join two (Pair)RDD
s (or Dataset
s/DataFrame
s) (on multiple fields) using some "custom criteria"/fuzzy matching, e.g. range/interval for numbers or dates and various "distance methods", e.g. Levenshtein, for strings?
For "grouping" within an RDD
to get a PairRDD
, one can implement a PairFunction
, but it seems that something similar is not possible when JOINing two RDD
s/data sets? I am thinking something like:
rdd1.join(rdd2, myCustomJoinFunction);
I was thinking about implementing the custom logic in hashCode()
and equals()
but I am not sure how to make "similar" data wind up in the same bucket. I have also been looking into RDD.cogroup()
but have not figured out how I could use it to implement this.
I just came across elasticsearc-hadoop. Does anyone know if that library could be used to do something like this?
I am using Apache Spark 2.0.0. I am implementing in Java but an answer in Scala would also be very helpful.
PS. This is my first Stackoverflow question so bear with me if I have made some newbie mistake :).