0

I need to compare two tables (i.e data frames) in spark data frame, row by row, and get each row with the smaller value for a specific column. For example:

enter image description here

Let's say I want to get each row with the lower scored subject for each student, so therefore I want this result:

enter image description here

I was thinking of joining both data frame first with id as join attribute, but my original tables are large and have more attributes. It seems that this is doable without join. The closest problem I can found is this but I don't know get how to apply that to my case.

Btw solutions with join are also appreciable, I am just thinking if there can be a better solution.

zero323
  • 322,348
  • 103
  • 959
  • 935
Algorithman
  • 1,309
  • 1
  • 16
  • 39

1 Answers1

2

You cannot do it without join, unless you can guarantee that each both have the same number of partitions and records in each partition. Then you can convert to RDD and zip. Otherwise just join:

import org.apache.spark.sql.functions.{least, struct}

val df1 = Seq(
  (345, "math", 70), (992, "chem", 76), (223, "bio", 80)
).toDF("id", "subject", "score")

val df2 = Seq(
  (345, "psy", 64), (992, "ant", 94), (223, "math",   45)
).toDF("id", "subject", "score")

df1.alias("df1")
  .join(df2.alias("df2"), Seq("id"))
  .select($"id", 
    least(struct($"df1.score", $"df1.subject"),
          struct($"df2.score", $"df2.subject")).alias("score"))
  .select($"id", $"score.subject", $"score.score")

// +---+-------+-----+
// | id|subject|score|
// +---+-------+-----+
// |345|    psy|   64|
// |992|   chem|   76|
// |223|   math|   45|
// +---+-------+-----+

or

import org.apache.spark.sql.functions.when

df1.alias("df1")
  .join(df2.alias("df2"), Seq("id"))
  .select(
    $"id",
    when($"df1.score" < $"df2.score", $"df1.subject").otherwise($"df2.subject").alias("subject"),
     least($"df1.score", $"df2.score").alias("score"))

// +---+-------+-----+
// | id|subject|score|
// +---+-------+-----+
// |345|    psy|   64|
// |992|   chem|   76|
// |223|   math|   45|
// +---+-------+-----+
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115