I have two data frames that holds values for some people in two different time stamps. The possible changes for a person before and after are listed in the code below.
val before = Seq(
(1, "soccer", "1", "2", "3", "4", ""),
(2, "soccer", "", "", "", "", ""),
(3, "soccer", "1", "", "", "", ""),
(4, "soccer", "1", "", "", "", ""),
(5, "soccer", "1", "", "", "", ""),
(6, "soccer", "1", "", "", "", "")
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
before.show //> +---+------+----+----+----+----+----+
//| | id| sport|var1|var2|var3|var4|var5|
//| +---+------+----+----+----+----+----+
//| | 1|soccer| 1| 2| 3| 4| |
//| | 2|soccer| | | | | |
//| | 3|soccer| 1| | | | |
//| | 4|soccer| 1| | | | |
//| | 5|soccer| 1| | | | |
//| | 6|soccer| 1| | | | |
//| +---+------+----+----+----+----+----+
//|
val after = Seq(
(1, "soccer", "1", "2", "3", "4", ""), // Same
(2, "soccer", "1", "", "", "", ""), // Addition
(3, "soccer", "1", "1", "", "", ""), // Addition
(4, "soccer", "", "", "", "", ""), // Remove
(5, "soccer", "2", "1", "", "", ""), // Slide
(6, "soccer", "2", "", "", "", "") // Change
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
after.show //> +---+------+----+----+----+----+----+
//| | id| sport|var1|var2|var3|var4|var5|
//| +---+------+----+----+----+----+----+
//| | 1|soccer| 1| 2| 3| 4| |
//| | 2|soccer| 1| | | | |
//| | 3|soccer| 1| 1| | | |
//| | 4|soccer| | | | | |
//| | 5|soccer| 2| 1| | | |
//| | 6|soccer| 2| | | | |
//| +---+------+----+----+----+----+----+
//|
So things can stay the same, there could be an addition or a removal, and finally there could be a change or a slide.
My ideal output is something that confronts each row in before and after data frames and attach a label:
outcome.show //> +---+------+------+
//| | id| sport| diff|
//| +---+------+------+
//| | 1|soccer| same|
//| | 2|soccer| add|
//| | 3|soccer| add|
//| | 4|soccer|remove|
//| | 5|soccer| slide|
//| | 6|soccer|change|
//| +---+------+------+
//|
This question is related to this one, but the point there was just to count how many differences there was between two rows... This time I am trying to understand these differences with a finer grain, but I am stuck in defining the different possible options.
EDIT
Since I am using DataFrame, I'd like to stick to this structure rather than use case classes. I am thus trying to adapt what has been proposed by @iboss using DataFrame instead.
I have this UDF that should do all the work:
val diff = udf { (bef:DataFrame, aft:DataFrame) => {
"hello" // return just this string for now
} : String
}
This udf will do all the work, as suggested by @iboss, to produce the output in outcome.show, so the possible outcome after matching two rows will be a String, more precisely one of "same", "add", "remove", "slide" or "change".
I have then this code to merge the two data frames and create the new column:
val mydiff = before.join(after, "id")
.withColumn("diff", diff( before, after ) )
.select("id", "diff")
However, I have an error when calling diff that complains like this:
type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.sql.Column
What I don't understand is why it doesn't like the DataFrame and how to solve it...