Let's say I have 2 PySpark dataframe T1 and T2 with similar structure.There are some columns that exist in one and not repeated in the other.
T1
ID | balance | interest | T1_non_repeated_1 | T1_non_repeated_2
T2
ID | balance | interest | T2_non_repeated_1 | T2_non_repeated_2 | T2_not_repeated_3
I'd like to create a table that contains the average of these two where common columns match, with T2 IDs as base.
My thought so far for Pyspark (pseudo_code) is
T2.left_join(T1).withColumn("balance",(balance1+balance2)/2).withColumn("interest", (interest1+interest2)/2)....
My questions are:
This is a lengthy command in pyspark, if I have let's say 100 common columns for both tables. Any way to write differently and generate the command dynamically for all 100 common columns?
Other suggestions are welcome.
Thank you