-1

I am using spark-sql-2.4.1v how to do various joins depend on the value of column

Sample data

val data = List(
  ("20", "score", "school",  14 ,12),
  ("21", "score", "school",  13 , 13),
  ("22", "rate", "school",  11 ,14)
 )
val df = data.toDF("id", "code", "entity", "value1","value2")

+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 20|score|school|    14|    12|
| 21|score|school|    13|    13|
| 22| rate|school|    11|    14|
| 21| rate|school|    13|    12|

based the "code" column value i need to do join with various other tables

val rateDs = // val data1= List(
  ("22", 11 ,A),
  ("22", 14 ,B),
  ("20", 13 ,C),
  ("21", 12 ,C),
  ("21", 13 ,D)
)

val df = data1.toDF("id", "map_code","map_val")

val scoreDs = // scoreTable 

if the "code" column value is "rate" i need to join with rateDs if the "code" column value is "score" i need to join with scoreDs

how to handle these kind of things in spark ? any optimum way to achieve this?

Expected result for "rate" fields

+---+-----+------+------+------+
| id| code|entity|value1|value2|
+---+-----+------+------+------+
| 22| rate|school|     A|    B |
| 21| rate|school|     D|    C |
BdEngineer
  • 2,929
  • 4
  • 49
  • 85

1 Answers1

1

You can simply join twice, for example

val data = List(
  ("20", "score", "school",  14 , 12),
  ("21", "score", "school",  13 , 13),
  ("22", "rate", "school",  11 , 14),
  ("21", "rate", "school",  13 , 12)    
 )
val df = data.toDF("id", "code", "entity", "value1","value2")

val data1 = List(
  ("22", 11 ,"A"),
  ("22", 14 ,"B"),
  ("20", 13 ,"C"),
  ("21", 12 ,"C"),
  ("21", 13 ,"D")
)
val rateDF = data1.toDF("id", "map_code","map_val")

df.as("a")
  .join(rateDF.as("b"),
       col("a.code") === lit("rate") 
        && col("a.id") === col("b.id") 
        && col("a.value1") === col("b.map_code"), "inner")
  .join(rateDF.as("c"),
       col("a.code") === lit("rate") 
        && col("a.id") === col("c.id") 
        && col("a.value2") === col("c.map_code"), "inner")
  .select(col("a.id"), col("a.code"), col("a.entity"), col("b.map_val").as("value1"), col("c.map_val").as("value2"))
  .show(false)

+---+----+------+------+------+
|id |code|entity|value1|value2|
+---+----+------+------+------+
|22 |rate|school|A     |B     |
|21 |rate|school|D     |C     |
+---+----+------+------+------+

Well, this looks a bit dirty, but I have no idea for the multiple columns...

Lamanus
  • 12,898
  • 4
  • 21
  • 47
  • thanks , it is possible to do using "when" clause ? this joins , i guess , would be performance hit. – BdEngineer Jul 28 '20 at 12:44
  • 1
    That is not recommended when you join the tables. – Lamanus Jul 28 '20 at 12:44
  • what is being done here coalesce("b.value1", "c.value1") ?? – BdEngineer Jul 28 '20 at 13:05
  • 1
    When `code = rate`, b.value1 will not be null, c.value1 will be null and when `code = score`, it will be reversed. So, coalesce collect those two results as a column but it is up to you, that is just an example. – Lamanus Jul 28 '20 at 13:09
  • after joining based on col("a.id") === col("b.id"), "left") condition , I need to loop up multiple column values i.e. "value_1" & "value_2" from "df" in "rateDs" ....how can that be handled ? – BdEngineer Jul 28 '20 at 13:11
  • You can add more condition by adding `&&`. For example when you need to join entity column too, then `col("a.code") === lit("rate") && col("a.id") === col("b.id") && col("a.entity") === col("b.entity"), "left"`. – Lamanus Jul 28 '20 at 13:13
  • updated the question , can you check once , included sample rateDs data and Expected result for "rate" – BdEngineer Jul 28 '20 at 13:35
  • thanks a lot , i have another use case like this , any advice please https://stackoverflow.com/questions/63137437/doing-multiple-column-value-look-up-after-joining-with-lookup-dataset – BdEngineer Jul 28 '20 at 15:16