PySpark efficient way to compute average of two tables on common columns

Asked Feb 24 '20 at 16:03

Active Feb 24 '20 at 16:03

Viewed 100 times

Let's say I have 2 PySpark dataframe T1 and T2 with similar structure.There are some columns that exist in one and not repeated in the other.

T1
ID | balance | interest | T1_non_repeated_1 | T1_non_repeated_2

T2
ID | balance | interest | T2_non_repeated_1 | T2_non_repeated_2 | T2_not_repeated_3

I'd like to create a table that contains the average of these two where common columns match, with T2 IDs as base.

My thought so far for Pyspark (pseudo_code) is

T2.left_join(T1).withColumn("balance",(balance1+balance2)/2).withColumn("interest", (interest1+interest2)/2)....

My questions are:

This is a lengthy command in pyspark, if I have let's say 100 common columns for both tables. Any way to write differently and generate the command dynamically for all 100 common columns?
Other suggestions are welcome.

Thank you

asked Feb 24 '20 at 16:03

Kenny

1,902
6
32
61

1

You might join them into some dataframe first, then use a `for` loop to do the averaging. – Ala Tarighati Feb 24 '20 at 16:18
Can you add an example of input and the expected output? See [How to make good reproducible Apache Spark examples](https://stackoverflow.com/q/48427185/1386551) – blackbishop Feb 24 '20 at 18:20

PySpark efficient way to compute average of two tables on common columns

0 Answers0