0

Let's say I have 2 PySpark dataframe T1 and T2 with similar structure.There are some columns that exist in one and not repeated in the other.

T1
ID | balance | interest | T1_non_repeated_1 | T1_non_repeated_2

T2
ID | balance | interest | T2_non_repeated_1 | T2_non_repeated_2 | T2_not_repeated_3

I'd like to create a table that contains the average of these two where common columns match, with T2 IDs as base.

My thought so far for Pyspark (pseudo_code) is

T2.left_join(T1).withColumn("balance",(balance1+balance2)/2).withColumn("interest", (interest1+interest2)/2)....

My questions are:

  1. This is a lengthy command in pyspark, if I have let's say 100 common columns for both tables. Any way to write differently and generate the command dynamically for all 100 common columns?

  2. Other suggestions are welcome.

Thank you

Kenny
  • 1,902
  • 6
  • 32
  • 61
  • 1
    You might join them into some dataframe first, then use a `for` loop to do the averaging. – Ala Tarighati Feb 24 '20 at 16:18
  • Can you add an example of input and the expected output? See [How to make good reproducible Apache Spark examples](https://stackoverflow.com/q/48427185/1386551) – blackbishop Feb 24 '20 at 18:20

0 Answers0