How to know in scala runtime the different join types spark

Question

I would like to test a user input against a whitelist of available types of join of Spark.

Is there a way to know the different join types with a spark built in?

For instance, I would like to validate user's input against this Seq Seq("inner", "cross", "outer", "full", "fullouter", "left", "leftouter", "right", "rightouter", "leftsemi", "leftanti")

(Which are all join types available in Spark) Without hardcoding it as I have just done.

Could you give an example what you need and what the types you expected? — Moustafa Mahmoud, Jan 04 '19 at 12:09
I would like to know if there is a way to have anything like `Seq("inner", "cross", "outer", "full", "fullouter", "left", "leftouter", "right", "rightouter", "leftsemi", "leftanti")` Just something I can validate user's input for instance :) — BlueSheepToken, Jan 04 '19 at 12:12
You need to validate the join type in runtime and choose the nearest one? Or you need pass the join type in run time? — Moustafa Mahmoud, Jan 04 '19 at 12:15
For instance I would like to throw an exception if the user's input is not one of the different joinTypes. But I would like to avoid hardcoding the join types as I have just done — BlueSheepToken, Jan 04 '19 at 12:26

Moustafa Mahmoud · Answer 1 · 2019-01-04T13:25:38.417

I adapted the answer from this question here. You can also add the joinTypes in Json file to read in runtume. You can check this answer for json object handling JsonParsing

Update 1: I update the answer to follow Spark documentation way JoinType

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._


object SparkSandbox extends App {

  case class Row(id: Int, value: String)

  private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()

  import spark.implicits._

  spark.sparkContext.setLogLevel("ERROR")

  val r1 = Seq(Row(1, "A1"), Row(2, "A2"), Row(3, "A3"), Row(4, "A4")).toDS()
  val r2 = Seq(Row(3, "A3"), Row(4, "A4"), Row(4, "A4_1"), Row(5, "A5"), Row(6, "A6")).toDS()
  val validUserJoinType = "inner"
  val inValiedUserJoinType = "nothing"

  val joinTypes = Seq("inner", "outer", "full", "full_outer", "left", "left_outer", "right", "right_outer", "left_semi", "left_anti")

  inValiedUserJoinType match {
    case x => if (joinTypes.contains(x)) {
      println("do some logic")
      joinTypes foreach { joinType =>
        println(s"${joinType.toUpperCase()} JOIN")
        r1.join(right = r2, usingColumns = Seq("id"), joinType = joinType).orderBy("id").show()
      }
    }
    case _ =>
  val supported = Seq(
    "inner",
    "outer", "full", "fullouter", "full_outer",
    "leftouter", "left", "left_outer",
    "rightouter", "right", "right_outer",
    "leftsemi", "left_semi",
    "leftanti", "left_anti",
    "cross")

  throw new IllegalArgumentException(s"Unsupported join type '$inValiedUserJoinType'. " +
  "Supported join types include: " + supported.mkString("'", "', '", "'") + ".")
  }

}

Thank you very much for your answer, but actually I am more looking for a built in function, the types of joins are still hardcoded, not in the code anymore but in a JSon. Do you know if there is built in for this? — BlueSheepToken, Jan 04 '19 at 12:52
But if you check in spark documentation it is hard coded they didn't add something like case class or types https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala — Moustafa Mahmoud, Jan 04 '19 at 13:01
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala Apparently you are right, this is hardcoded in the apply method :( — BlueSheepToken, Jan 04 '19 at 13:20
I updated the answer to follow same way for spark handling way. — Moustafa Mahmoud, Jan 04 '19 at 13:26

score 2 · Accepted Answer · answered Jan 04 '19 at 13:17

2

Sorry this is not possible without a PR into the Spark project itself. The join types are defined inline at JoinType. There are classes that extend JoinType but the naming convention is different to that of the strings used in the case statement. So you're out of luck I'm afraid.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala

answered Jan 04 '19 at 13:17

David

755
5
11

This was my exact question, I do not particulary need it, it was more for interest – BlueSheepToken Jan 04 '19 at 13:19
2

I appreciate your effort to avoid hardcoding! – David Jan 04 '19 at 13:20

How to know in scala runtime the different join types spark

2 Answers2