Can we provide our own function as a join condition on RDD or Dataframes in Spark?

Question

Using Apache Spark 1.6.0 on a CDH. I have an RDD which includes a Name column. I also have a list of my customer names in a separate Dataframe. I need to join these two but it won't be an exact match join.

First I create a data frame from my RDD.

Schema of the Dataframe (called df) created from Rdd:
root
 |-- someName: string (nullable = false)

My second Dataframe (which includes my customer names) is this:

val customerNameDf = Seq(
      (8, "John Smith"),
      (64, "Adam Sandler"),
      (-27, "Matt Jellipop")
    ).toDF("id", "name")

If I do the following, I find the names (and lastnames) that match exactly between those two data frames.

val newDf = df.join(customerNameDf, df("someName") === customerNameDf("name"))

This is okay and typical for the Dataframe Joins. However, what I need is a bit different. I cannot do an exact match for the join. Instead, what I need is a match based on a similarity function.

For example, if the names are "John Smith" in one data frame and "J. Smith" in the other data frame, then I need to call these two records as joined (matched).

So, is there anything that can help? For instance, instead of using === operator, is there anything similar to the following that I can follow? Note that isSimilar is my own function that I've actually implemented.

val newDf = df.join(customerNameDf, isSimilar(df("someName"), customerNameDf("name")))

Or some other efficient way that can help?

Let me summarise what we actually need to do. I have a csv file:

John Smith,10500,2017,xyz
John Clarke,3500,2017,abc
J. Smith,600,2017,klm

A second CSV file:

John Smith,100
Adam Sandler,101

I need to get the rows from the first CSV from above, whose name column is similar (not necessarily exact match) to a name value within the second CSV. In this case, I need to take the following two records:

John Smith,10500,2017,xyz
J. Smith,600,2017,klm

Note that these two names are similar to the following record in the second CSV:

John Smith,100

In this case, if a join using RDD or dataframes doesn't help, how can I approach this problem with Spark (Scala) in an efficient manner?

Answering my own question. This turned out to be a case where we need to do cartesian product of the two RDDs. — newwebdev, Nov 26 '17 at 20:28

Can we provide our own function as a join condition on RDD or Dataframes in Spark?

0 Answers0