You can use the spark-extension package mentioned in this answer Compare two dataframes Pyspark
You can modify the result
dataframe to get the final dataframe appropriately.
from gresearch.spark.diff import *
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
("Source1",20201116, 436039, 123, 222, 333,0, 555),
("Source1",20201117,436034, 234, 34, 7, 5, 678)
]
df1Columns = ["Source","Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
print("Source1 dataframe")
df1.show(truncate=False)
data2 = [
("Source2", 20201116,436039,234,234,333,0,555),
("Source2", 20201117,436034,234,5,7,5,678)
]
df2Columns = ["Source", "Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)
print(" Source2 dataframe")
df2.show(truncate=False)
options = DiffOptions().with_change_column("changes")
result = df1.diff_with_options(df2, options, 'Snapshot_Date')
result.select("Snapshot_Date", "diff","changes").show(truncate=False)
The output is as shown. The column changes
lists down the columns in which changes are present.
Source1 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source1|20201116 |436039 |123 |222 |333 |0 |555 |
|Source1|20201117 |436034 |234 |34 |7 |5 |678 |
+-------+-------------+---------+----+----+----+----+----+
Source2 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source2|20201116 |436039 |234 |234 |333 |0 |555 |
|Source2|20201117 |436034 |234 |5 |7 |5 |678 |
+-------+-------------+---------+----+----+----+----+----+
+-------------+----+--------------------+
|Snapshot_Date|diff|changes |
+-------------+----+--------------------+
|20201116 |C |[Source, Col1, Col2]|
|20201117 |C |[Source, Col2] |
+-------------+----+--------------------+