I have a pyspark dataframe:
Location Month Brand Sector TrueValue PickoutValue
USA 1/1/2021 brand1 cars1 7418 30000
USA 2/1/2021 brand1 cars1 1940 2000
USA 3/1/2021 brand1 cars1 4692 2900
USA 4/1/2021 brand1 cars1
USA 1/1/2021 brand2 cars2 16383104.2 16666667
USA 2/1/2021 brand2 cars2 26812874.2 16666667
USA 3/1/2021 brand2 cars2
USA 1/1/2021 brand3 cars3 75.6% 70.0%
USA 3/1/2021 brand3 cars3 73.1% 70.0%
USA 2/1/2021 brand3 cars3 77.1% 70.0%
I'm having Month values from 1/1/2021 to 12/1/2021 for each Brands. I need to create another column with the cumulative sum of the TrueValue column based on brand and sector and order by Month. The rows having % values should be cumulative sum divided by the number of months.
My expected dataframe is:
Location Month Brand Sector TrueValue PickoutValue TotalSumValue
USA 1/1/2021 brand1 cars1 7418 30000 7418
USA 2/1/2021 brand1 cars1 1940 2000 9358
USA 3/1/2021 brand1 cars1 4692 2900 14050
USA 4/1/2021 brand1 cars1 14050
USA 1/1/2021 brand2 cars2 16383104.2 16666667 16383104.2
USA 2/1/2021 brand2 cars2 26812874.2 16666667 43195978.4
USA 3/1/2021 brand2 cars2 43195978.4
USA 1/1/2021 brand3 cars3 75.6% 70.0% 75.6%
USA 3/1/2021 brand3 cars3 73.1% 70.0% 76.3%
USA 2/1/2021 brand3 cars3 77.1% 70.0% 75.3%
For the rows having % values, this is how I need to calculate the cumulative sum ordering by month:
(75.6 + 0)/1 = 75.6%
(75.6 + 77.1)/2 = 76.3%
(75.6 + 77.1 + 73.1)/3 = 75.3%
I'm able to generate the cumulative sum but I'm not getting the cumulative sum of % values.
This is my code block:
df=df.withColumn("month_in_timestamp", to_timestamp(df.Month, 'dd/MM/yyyy'))
windowval = (Window.partitionBy('Brand','Sector').orderBy('Month')
.rangeBetween(Window.unboundedPreceding, 0))
df1 = df1.withColumn('TotalSumValue', F.sum('TrueValue').over(windowval))