I've the following two identically structurred dataframes with id in common.
val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000))
.toDF("id","name","city","credit_score","credit_limit")
scala> originalDF.show(false)
+---+------+---------+------------+------------+
|id |name |city |credit_score|credit_limit|
+---+------+---------+------------+------------+
|1 |gaurav|jaipur |550 |70000 |
|2 |sunil |noida |600 |80000 |
|3 |rishi |ahmedabad|510 |65000 |
+---+------+---------+------------+------------+
val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000))
.toDF("id","name","city","credit_score","credit_limit")
scala> changedDF.show(false)
+---+------+------+------------+------------+
|id |name |city |credit_score|credit_limit|
+---+------+------+------------+------------+
|1 |gaurav|jaipur|550 |70000 |
|2 |sunil |noida |650 |90000 |
|4 |Joshua|cochin|612 |85000 |
+---+------+------+------------+------------+
Hence I wrote one udf to calulate the change in column values.
val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
val somedf=changedDF.alias("a").join(originalDF.alias("b"), col("a.id") === col("b.id")).withColumn("diffcolumn", split(concat_ws(",",changedDF.columns.map(x => diff(lit(x), changedDF(x), originalDF(x))):_*),","))
scala> somedf.show(false)
+---+------+------+------------+------------+---+------+------+------------+------------+----------------------------------+
|id |name |city |credit_score|credit_limit|id |name |city |credit_score|credit_limit|diffcolumn |
+---+------+------+------------+------------+---+------+------+------------+------------+----------------------------------+
|1 |gaurav|jaipur|550 |70000 |1 |gaurav|jaipur|550 |70000 |[, , , , ] |
|2 |sunil |noida |650 |90000 |2 |sunil |noida |600 |80000 |[, , , credit_score, credit_limit]|
+---+------+------+------------+------------+---+------+------+------------+------------+----------------------------------+
But I'm not able to get id and diffcolumn separately. If I do a somedf.select('id) it gives me ambiguity error coz there are two ids in the joined table I want to get all the name of the columns in any array and id corresponding to which the values have changed. Like in the changedDF credit score and credit limit of id=2,name=sunil has been changed. Hence I wanted the resultant dataframe to give me result like
+--+---+------+------+------------+------------+---+
|id | diffcolumn |
+---+------+------+------------+------------+---
|2 |[, , , credit_score, credit_limit] |
+---+------+------+------------+------------+---+
Can anyone suggest me what approach to follow to get eh id and changed column separately in a dataframe.