I have two data frames and I am doing outer join. So when I do outer join it leaves me with so many duplicate rows with same header.
Here is my df1:
+----------+------+---------+--------+------+
| OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877341| 136| 4| 1| I|!||
|4295877346| 136| 4| 1| I|!||
|4295877341| 138| 2| 1| I|!||
|4295877341| 141| 4| 1| I|!||
|4295877341| 143| 2| 1| I|!||
|4295877341| 145| 14| 1| I|!||
| 123456789| 145| 14| 1| I|!||
+----------+------+---------+--------+------+
Here is my df2:
+----------+------+---------+--------+------+
| OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
| 123456789| 145| 14| 1| D|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
+----------+------+---------+--------+------+
Now I am joining both data frame in order to get below expected output.
+----------+------+---------+--------+------+
| OrgId|ItemId|segmentId|Sequence|Action|
+----------+------+---------+--------+------+
|4295877346| 136| 4| 1| I|!||
|4295877341| 145| 14| 1| I|!||
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
+----------+------+---------+--------+------+
Please note that here in the output I update data frame 1 with data frame 2 records for the matching records and new records in df2 is also getting in the output and if any records in df1 not matching with df2 that also will be in the output data frame.
To do this here is what I have tried.
Firstly I have tried this.
val df3 = df1.join(latestForEachKey, Seq("OrgId", "ItemId"), "outer")
then i got below output ..
+----------+------+---------+--------+------+---------+--------+------+
| OrgId|ItemId|segmentId|Sequence|Action|segmentId|Sequence|Action|
+----------+------+---------+--------+------+---------+--------+------+
|4295877346| 136| 4| 1| I|!|| null| null| null|
|4295877341| 145| 14| 1| I|!|| null| null| null|
|4295877343| 149| null| null| null| 15| 2| I|!||
|4295877341| 136| 4| 1| I|!|| null| null| I|!||
| 123456789| 145| 14| 1| I|!|| 14| 1| D|!||
|4295877341| 138| 2| 1| I|!|| 11| 22| I|!||
|4295877341| 141| 4| 1| I|!|| 10| 1| I|!||
|4295877341| 143| 2| 1| I|!|| 1| 1| I|!||
+----------+------+---------+--------+------+---------+--------+------+
Then i tried with renaming the df2 column renamed and executed below code .
val df3 = df1.join(latestForEachKey, Seq("OrgId", "ItemId"), "outer")
.select($"OrgId", $"ItemId",
when($"segmentId_1".isNotNull, $"segmentId_1").otherwise($"segmentId").as("segmentId"),
when($"Sequence_1".isNotNull, $"Sequence_1").otherwise($"Sequence").as("Sequence"),
when($"Action_1".isNotNull, $"Action_1").otherwise($"Action").as("Action")).filter(!$"Action".contains("D"))
Now this has given me the correct output but I have to write so may times when condition.
I have origin data set with 60 columns so then it might the problem.
Please suggest some way out.