0

I am having two spark data frames like below

Source1 data

Source2 data

I am using pyspark python to compare the data between the two sources using the Snapshot_Date as Key column and want to display the result in another dataframe like below

Compare color coding is for easy understanding and not needed

Thanks in Advance

1 Answers1

0

You can use the spark-extension package mentioned in this answer Compare two dataframes Pyspark

You can modify the result dataframe to get the final dataframe appropriately.

from gresearch.spark.diff import *

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data1 = [
        ("Source1",20201116, 436039, 123, 222, 333,0, 555),
        ("Source1",20201117,436034, 234, 34, 7, 5, 678)
      ]

df1Columns = ["Source","Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)

print("Source1 dataframe")
df1.show(truncate=False)

data2 = [
        ("Source2", 20201116,436039,234,234,333,0,555),
        ("Source2", 20201117,436034,234,5,7,5,678)
      ]

df2Columns = ["Source", "Snapshot_Date","REC_COUNT","Col1","Col2","Col3","Col4","Col5"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)

print(" Source2 dataframe")
df2.show(truncate=False)

options = DiffOptions().with_change_column("changes")
result = df1.diff_with_options(df2, options, 'Snapshot_Date')

result.select("Snapshot_Date", "diff","changes").show(truncate=False)

The output is as shown. The column changes lists down the columns in which changes are present.

Source1 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source1|20201116     |436039   |123 |222 |333 |0   |555 |
|Source1|20201117     |436034   |234 |34  |7   |5   |678 |
+-------+-------------+---------+----+----+----+----+----+

 Source2 dataframe
+-------+-------------+---------+----+----+----+----+----+
|Source |Snapshot_Date|REC_COUNT|Col1|Col2|Col3|Col4|Col5|
+-------+-------------+---------+----+----+----+----+----+
|Source2|20201116     |436039   |234 |234 |333 |0   |555 |
|Source2|20201117     |436034   |234 |5   |7   |5   |678 |
+-------+-------------+---------+----+----+----+----+----+

+-------------+----+--------------------+
|Snapshot_Date|diff|changes             |
+-------------+----+--------------------+
|20201116     |C   |[Source, Col1, Col2]|
|20201117     |C   |[Source, Col2]      |
+-------------+----+--------------------+
user238607
  • 1,580
  • 3
  • 13
  • 18