I have a dataframe with the below data and columns:
sales_df.select('sales', 'monthly_sales').show()
+----------------+--------------------------+ | sales | monthly_sales | +----------------+--------------------------+ | mid| 50.0| | low| 21.0| | low| 25.0| | high| 70.0| | mid| 60.0| | high| 75.0| | high| 95.0| |................|..........................| |................|..........................| |................|..........................| | low| 25.0| | low| 20.0| +----------------+--------------------------+
I am trying to find the average of each sales type into a dataframe where I only have three rows(one for each sales type) in my final dataframe.
sale & average_sale
I used groupBy
to achieve this.
sales_df.groupBy("sales").avg("monthly_sales").alias('average_sales').show()
and I was able to get the average sale as well.
+----------------+-------------------------------+ | sales | average sales | +----------------+-------------------------------+ | mid| 5.568177828054298| | high| 1.361184210526316| | low| 3.014350758853288| +----------------+-------------------------------+
This ran faster because I am running my logic on test data which has 200 rows. So the code ran in no time. But I have huge data in my actual application and then there is the problem of data shuffle due to groupBy.
Is there any better way to find out the average without using groupBy
?
Could anyone let me know the efficient way to achieve the solution considering huge data size.