0

I am new to pyspark so I wanted to know, is there any better way to do group by on multiple columns one by one instead of using loop over all columns? currenctly, I am using loop to iterate over all required group by columns but it is taking very long time. I have around 50-60 columns for which I need to group one by one using aggregration on fixed columns. current code using loop:

for name in req_string_columns:
   tmp=Selected_data.groupBy(name).agg(mean("ABC"),mean("XYZ"),count("ABC")
                                         ,count("XYZ")).withColumnRenamed(name,'Category')

Is there any better way to do it?

ASD
  • 25
  • 6
  • so, your expected output is like mean of ABC for groups present in col1, then separately for col2, and so on? – samkart Nov 07 '22 at 08:15
  • Yes that is correct – ASD Nov 07 '22 at 08:27
  • It is duplicate. You will find similar question and answer for this. E.g. https://stackoverflow.com/questions/33882894/spark-sql-apply-aggregate-functions-to-a-list-of-columns – Ramdev Sharma Nov 07 '22 at 09:35
  • Thanks. @RamdevSharma But that is to apply aggregation on multiple columns while group by column remain the same. My question is the other way around the same, Meaning the same aggregate function on a list of groups by column one by one. some thing like this: df.groupBy("col1").agg(mean("ABC")) then df.groupBy("col2").agg(mean("ABC")) and so on for 50 to 60 column – ASD Nov 07 '22 at 14:44
  • There is no other way. Each group By is shuffle, you cannot combine. you may find a way where you know same column set for specific aggregations. – Ramdev Sharma Nov 07 '22 at 14:58

0 Answers0