I have an input dataframe as below:
partner_id|month_id|value1 |value2
1001 | 01 |10 |20
1002 | 01 |20 |30
1003 | 01 |30 |40
1001 | 02 |40 |50
1002 | 02 |50 |60
1003 | 02 |60 |70
1001 | 03 |70 |80
1002 | 03 |80 |90
1003 | 03 |90 |100
Using the below code, I have created two new columns which does the average using window function:
rnum = (Window.partitionBy("partner_id").orderBy("month_id").rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn("value1_1", F.avg("value1").over(rnum))
df = df.withColumn("value1_2", F.avg("value2").over(rnum))
Output:
partner_id|month_id|value1 |value2|value1_1|value2_2
1001 | 01 |10 |20 |10 |20
1002 | 01 |20 |30 |20 |30
1003 | 01 |30 |40 |30 |40
1001 | 02 |40 |50 |25 |35
1002 | 02 |50 |60 |35 |45
1003 | 02 |60 |70 |45 |55
1001 | 03 |70 |80 |40 |50
1002 | 03 |80 |90 |50 |60
1003 | 03 |90 |100 |60 |70
The cumulative average is performing well on the value1 and value2 columns using pyspark Window function. But, if we miss one month data in the input like below, for the next month average calculation should happen based on month no. instead of normal average. For example, if the input is like below (month 02 data is missing)
partner_id|month_id|value1 |value2
1001 | 01 |10 |20
1002 | 01 |20 |30
1003 | 01 |30 |40
1001 | 03 |70 |80
1002 | 03 |80 |90
1003 | 03 |90 |100
Then the average calculation on month three records is happening as below: for ex: (70 + 10)/2 But, What is the correct way of doing average if certain month values are missing???