1

I have a dataframe with the below data and columns:

sales_df.select('sales', 'monthly_sales').show()
+----------------+--------------------------+
|      sales     |        monthly_sales     |
+----------------+--------------------------+
|             mid|                      50.0|
|             low|                      21.0|
|             low|                      25.0|
|            high|                      70.0|
|             mid|                      60.0|
|            high|                      75.0|
|            high|                      95.0|
|................|..........................|
|................|..........................|
|................|..........................|    
|             low|                      25.0|
|             low|                      20.0|
+----------------+--------------------------+

I am trying to find the average of each sales type into a dataframe where I only have three rows(one for each sales type) in my final dataframe. sale & average_sale

I used groupBy to achieve this.

sales_df.groupBy("sales").avg("monthly_sales").alias('average_sales').show()

and I was able to get the average sale as well.

+----------------+-------------------------------+
|      sales     |          average sales        |
+----------------+-------------------------------+
|             mid|              5.568177828054298|
|            high|              1.361184210526316|
|             low|              3.014350758853288|
+----------------+-------------------------------+

This ran faster because I am running my logic on test data which has 200 rows. So the code ran in no time. But I have huge data in my actual application and then there is the problem of data shuffle due to groupBy.

Is there any better way to find out the average without using groupBy ? Could anyone let me know the efficient way to achieve the solution considering huge data size.

Metadata
  • 2,127
  • 9
  • 56
  • 127
  • You can probably use RDD-based functions like `map` and `reduceByKey` where you would treat the records more directly with a more MapReduce-like way. However, DataFrames are by far more optimized for those kinds of processes, because in case you end up using RDDs you have to be _extremely_ cautious with the transformations and actions that you are going to apply to avoid strangling the data flow. I also doubt if it will make any significant difference in terms of efficiency, but that's really a matter of Spark proficiency. – Coursal May 01 '21 at 16:59
  • 3
    `group by` on a DataFrame is already as smart as possible, see also https://stackoverflow.com/a/32903568/4744359 – UninformedUser May 01 '21 at 18:09

1 Answers1

0

groupBy is exactly what you're looking for. Spark is designed to handle big data (in any size, really), so what you should do is configure your Spark application properly (i.e giving it the right amount of memory, increase the number of cores, using more executors, improve parallelism, ...)

pltc
  • 5,836
  • 1
  • 13
  • 31