I have two dataframes lets say dfA and dfB.
dfA:
IdCol | Col2 | Col3
id1 | val2 | val3
dfB:
IdCol | Col2 | Col3
id1 | val2 | val4
The two data frames join in IdCol. I want to compare them per row and keep columns different and the their values in another dataframe. For example from the two above dataframes I want a the result:
dfChanges:
RowId | Col | dfA_value | dfB_value |
id1 | Col3 | val_3 | val_4 |
I am kinda stuck on how to do this. Can anyone provide a direction? Thanks in advance
EDIT
My try is this. But its not very clear or has good performance. Is there a better way to do it?
dfChanges = None
#for all column excpet id
for colName in dfA.column[1:]:
#Select whole columns of id and targeted column
#from both datasets and subtract to find differences
changedRows = dfA.select(['IdCol',colName]).subtract(dfB.select(['IdCol',colName]))
#Join with dfB to take the value of targeted column from there
temp = changedRows.join(dfB.select(col('IdCol'),col(colName).alias("dfB_value")),dfA.IdCol == dfB.IdCol, 'inner'). \
drop(dfB.IdCol)
#Proper Rename columns
temp = temp.withColumnRenamed(colname,"dfA_value")
temp = temp.withColumn("Col",lit(colName))
#Append to a single dataframe
if (dfChanges is None):
dfChanges = temp
else:
dfChanges = dfChanges.union(temp)