0

How to calculate the join of two Dataframes using multiple columns as key? For example DF1 , DF2 are the two dataFrame.

This is the way by which we can calculate the join,

JoinDF = DF1.join(DF2, DF1("column1") === DF2("column11") && DF1("column2") === DF2("column22"), "outer") 

But my problem is how to access the multiple columns if they are stored in an arrays like :

DF1KeyArray=Array{column1,column2}
DF2KeyArray=Array{column11,column22}

then It is not possible to calculate the join by this method

JoinDF = DF1.join(DF2, DF1(DF1KeyArray)=== DF2(DF2KeyArray), "outer")

In this case error was :

<console>:128: error: type mismatch;
found   : Array[String]
required: String

Is there any way to access multiple columns as keys stored in an Array for calculation of join?

zero323
  • 322,348
  • 103
  • 959
  • 935
N.Mittal
  • 19
  • 1
  • 7

1 Answers1

10

You can simply create joinExprs programmatically:

val df1KeyArray: Array[String] = ???
val df2KeyArray: Array[String] = ???

val df1: DataFrame = ???
val df2: DataFrame = ???

val joinExprs = df1KeyArray
  .zip(df2KeyArray)
  .map{case (c1, c2) => df1(c1) === df2(c2)}
  .reduce(_ && _)

df1.join(df2, joinExprs, "outer")

See also Including null values in an Apache Spark Join

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    nice use of `.reduce(_ && _)` – Zahiro Mor Feb 02 '16 at 12:21
  • Best way to join using multiple columns. @zero323 Please add that if someone wants to return true (rather than NULL) if both inputs are NULL, then === must be replaced with <=> (EqualNullSafe) – Panayotis May 08 '17 at 16:31