Compare two data frames with the same schema row by row

Question

I need to compare two tables (i.e data frames) in spark data frame, row by row, and get each row with the smaller value for a specific column. For example:

Let's say I want to get each row with the lower scored subject for each student, so therefore I want this result:

I was thinking of joining both data frame first with id as join attribute, but my original tables are large and have more attributes. It seems that this is doable without join. The closest problem I can found is this but I don't know get how to apply that to my case.

Btw solutions with join are also appreciable, I am just thinking if there can be a better solution.

using join will be fine too, but even with join, I still have no idea on how to get to the result I want — Algorithman, Dec 06 '17 at 18:02

Alper t. Turker · Accepted Answer · 2017-12-06T18:12:28.793

You cannot do it without join, unless you can guarantee that each both have the same number of partitions and records in each partition. Then you can convert to RDD and zip. Otherwise just join:

import org.apache.spark.sql.functions.{least, struct}

val df1 = Seq(
  (345, "math", 70), (992, "chem", 76), (223, "bio", 80)
).toDF("id", "subject", "score")

val df2 = Seq(
  (345, "psy", 64), (992, "ant", 94), (223, "math",   45)
).toDF("id", "subject", "score")

df1.alias("df1")
  .join(df2.alias("df2"), Seq("id"))
  .select($"id", 
    least(struct($"df1.score", $"df1.subject"),
          struct($"df2.score", $"df2.subject")).alias("score"))
  .select($"id", $"score.subject", $"score.score")

// +---+-------+-----+
// | id|subject|score|
// +---+-------+-----+
// |345|    psy|   64|
// |992|   chem|   76|
// |223|   math|   45|
// +---+-------+-----+

or

import org.apache.spark.sql.functions.when

df1.alias("df1")
  .join(df2.alias("df2"), Seq("id"))
  .select(
    $"id",
    when($"df1.score" < $"df2.score", $"df1.subject").otherwise($"df2.subject").alias("subject"),
     least($"df1.score", $"df2.score").alias("score"))

// +---+-------+-----+
// | id|subject|score|
// +---+-------+-----+
// |345|    psy|   64|
// |992|   chem|   76|
// |223|   math|   45|
// +---+-------+-----+

Compare two data frames with the same schema row by row

1 Answers1