Spark Join of 2 dataframes which have 2 different column names in list

Question

Is there a way to join two Spark Dataframes with different column names via 2 lists?

I know that if they had the same names in a list I could do the following:

val joindf = df1.join(df2, Seq("col_a", "col_b"), "left")

or if I knew the different column names I could do this:

df1.join(
df2, 
df1("col_a") <=> df2("col_x")
    && df1("col_b") <=> df2("col_y"),
"left"
)

Since my method is expecting inputs of 2 lists which specify which columns are to be used for the join for each DF, I was wondering if Scala Spark had a way of doing this?

P.S I'm looking for something like a python pandas merge:

joindf = pd.merge(df1, df2, left_on = list1, right_on = list2, how = 'left')

score 2 · Answer 1 · answered May 07 '18 at 19:21

You can easely define such a method yourself:

 def merge(left: DataFrame, right: DataFrame, left_on: Seq[String], right_on: Seq[String], how: String) = {
      import org.apache.spark.sql.functions.lit
      val joinExpr = left_on.zip(right_on).foldLeft(lit(true)) { case (acc, (lkey, rkey)) => acc and (left(lkey) === right(rkey)) }
      left.join(right, joinExpr, how)
    }


val df1 = Seq((1, "a")).toDF("id1", "n1")
val df2 = Seq((1, "a")).toDF("id2", "n2")

val joindf = merge(df1, df2, left_on = Seq("id1", "n1"), right_on = Seq("id2", "n2"), how = "left")

Alper t. Turker · Accepted Answer · 2018-05-07T20:59:30.157

1

If you expect two lists of strings:

val leftOn = Seq("col_a", "col_b")
val rightOn = Seq("col_x", "coly")

Just zip and reduce:

import org.apache.spark.sql.functions.col

val on = leftOn.zip(rightOn)
  .map { case (x, y) => df1(x) <=> df2(y) }
  .reduce(_ and _)

df1.join(df2, on, "left")

edited May 07 '18 at 20:59

answered May 07 '18 at 19:15

Alper t. Turker

34,230
9
83
115

Ambiguity when both sequences have value(s) -- one or more -- that are the same. – Zzrot May 07 '18 at 20:57
In that case you can reference tables directly. – Alper t. Turker May 07 '18 at 20:59
Still having certain issues with duplicate columns being generated after joining 'left'. Any idea how I should deal with that? – Zzrot May 10 '18 at 13:44
That's normal. If columns from the other table are obsolete, you can drop these later. – Alper t. Turker May 10 '18 at 13:56
In order to drop, I would need to know the column names, but if both are the same, then I would have to rename each of the duplicate ones to something unique and then drop. Is there a better way of doing this? Eg. adding a "suffix" to the columns that are duplicated while joining? – Zzrot May 10 '18 at 14:09
It is Python but available methods are the same: https://stackoverflow.com/q/33778664/9613318 – Alper t. Turker May 10 '18 at 14:32

Spark Join of 2 dataframes which have 2 different column names in list

2 Answers2