Using Apache Spark 1.6.0 on a CDH. I have an RDD which includes a Name column. I also have a list of my customer names in a separate Dataframe. I need to join these two but it won't be an exact match join.
First I create a data frame from my RDD.
Schema of the Dataframe (called df) created from Rdd:
root
|-- someName: string (nullable = false)
My second Dataframe (which includes my customer names) is this:
val customerNameDf = Seq(
(8, "John Smith"),
(64, "Adam Sandler"),
(-27, "Matt Jellipop")
).toDF("id", "name")
If I do the following, I find the names (and lastnames) that match exactly between those two data frames.
val newDf = df.join(customerNameDf, df("someName") === customerNameDf("name"))
This is okay and typical for the Dataframe Joins. However, what I need is a bit different. I cannot do an exact match for the join. Instead, what I need is a match based on a similarity function.
For example, if the names are "John Smith" in one data frame and "J. Smith" in the other data frame, then I need to call these two records as joined (matched).
So, is there anything that can help? For instance, instead of using ===
operator, is there anything similar to the following that I can follow? Note that isSimilar
is my own function that I've actually implemented.
val newDf = df.join(customerNameDf, isSimilar(df("someName"), customerNameDf("name")))
Or some other efficient way that can help?
Let me summarise what we actually need to do. I have a csv file:
John Smith,10500,2017,xyz
John Clarke,3500,2017,abc
J. Smith,600,2017,klm
A second CSV file:
John Smith,100
Adam Sandler,101
I need to get the rows from the first CSV from above, whose name column is similar (not necessarily exact match) to a name value within the second CSV. In this case, I need to take the following two records:
John Smith,10500,2017,xyz
J. Smith,600,2017,klm
Note that these two names are similar to the following record in the second CSV:
John Smith,100
In this case, if a join using RDD or dataframes doesn't help, how can I approach this problem with Spark (Scala) in an efficient manner?