How to solve the crossjoin self join output using spark dsl

Question

import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions._

object crossjoin {
  def main(args:Array[String]):Unit= {

    val spark: SparkSession = SparkSession.builder()
      .master("local[*]")
      .appName("SparkByExamples.com")
      .getOrCreate()
      var sparkConf: SparkConf = null

 sparkConf = new SparkConf().set("spark.sql.crossJoin.enabled", "true")
      spark.sparkContext.setLogLevel("ERROR")
  import spark.implicits._
 
 
  val df1 = List("IN","PK", "AU","SL").toDF("country")
  df1.show()

//df1.withColumn("combinations", //collect_set("country").over(Window.orderBy()))
//.show(false)
  
}
}

Input:
+-------+
|country|
+-------+
|     IN|
|     PK|
|     AU|
|     SL|
+-------+

output

+--------+
|  result|
+--------+
|AU vs SL|
|AU vs PK|
|AU vs IN|
|IN vs PK|
+--------+

The result should not contain duplicates. Like something cross join should be performed i think. i tried but am unable to solve it. i got this sql query.

select concat(c1.country,'vs',c2.country) as result from country c1 
left join country c2 on c1.country!=c2.coutry 
where c1.country!='PK' and c2.country!='IN' and (c1.country!='SL' or c2.country='PK') 
order by result

By duplicate you mean `SL vs AU` is duplicate of `AU vs SL`? Why not `IN vs SL` and `SL vs PK` in output? — June7, Aug 23 '21 at 02:09

score 1 · Accepted Answer · answered Aug 23 '21 at 07:02

You can use Window functionality to to get desired result.

val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

import spark.implicits._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

List("IN", "PK", "AU", "SL").toDF("country")
// Combinations column will have array[country], countries will be picked up from next rows
   .withColumn("combinations", collect_set("country").
      over(Window.rowsBetween(Window.currentRow + 1, Window.unboundedFollowing)))
    // Last row will have empty array, filter that
   .where(size('combinations) > 0)
    // Concat each element of array column with country
   .withColumn("combinations",
      expr("transform(combinations, c-> concat_ws(' vs ', country, c))"))
    // Explode array to get each element of array in rows.
   .select(explode('combinations))
   .show(false)
/*
+--------+
|col     |
+--------+
|IN vs SL|
|IN vs AU|
|IN vs PK|
|PK vs SL|
|PK vs AU|
|AU vs SL|
+--------+*/

How to solve the crossjoin self join output using spark dsl

1 Answers1