With the following data:
val df1 = sc.parallelize(Seq(1->2, 2->3, 3->7, 5->4, 1->3)).toDF("col1", "col2")
val df2 = sc.parallelize(Seq(1->3, 5->1)).toDF("col1", "col2")
Then you can join your datasets with a or condition:
val cols = df1.columns
df1.join(df2, cols.map(c => df1(c) === df2(c)).reduce(_ || _) )
.select(cols.map(df1(_)) :_*)
.distinct
.show
+----+----+
|col1|col2|
+----+----+
| 2| 3|
| 1| 2|
| 1| 3|
| 5| 4|
+----+----+
The join condition is generic and would work for any number of columns. The code maps each column to an equality between that column in df1 and the same one in df2 cols.map(c => df1(c) === df2(c))
. The the reduce takes the logical or of all these equalities, which is what you want.
The select is there because otherwise the columns of both dataframes would be kept. Here I simply keep the ones from df1. I also added a distinct in case several lines of df2 would match a line of df1 or vice versa. Indeed, you may get a cartesian product.
Note that this method does not need any collection to the driver so it will work regardless of the size of the datasets. Yet, if df2 is small enough to be collected to the driver and braodcasted, you would get faster results with a method like this:
// to each column name, we map the set of values in df2.
val valueMap = df2.rdd
.flatMap(row => cols.map(name => name -> row.getAs[Any](name)))
.distinct
.groupByKey
.mapValues(_.toSet)
.collectAsMap
//we create a udf that looks up in valueMap
val filter = udf((name : String, value : Any) =>
valueMap(name).contains(value))
//Finally we apply the filter.
df1.where( cols.map(c => filter(lit(c), df1(c))).reduce(_||_))
.show
With this method, no shuffling of df1 and no cartesian product. If df2 is small, this is definitely the way to go.