1

I'm looking to groupBy agg on the below Spark dataframe and get the mean, max, and min of each of the col1, col2, col3 columns

sp = spark.createDataFrame([['a',2,4,5], ['a',4,7,7], ['b',6,0,9], ['b', 2, 4, 4], ['c', 4, 4, 9]], ['id', 'col1', 'col2','col3'])

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  a|   2|   4|   5|
|  a|   4|   7|   7|
|  b|   6|   0|   9|
|  b|   2|   4|   4|
|  c|   4|   4|   9|
+---+----+----+----+

I've tried sp.groupBy('id').agg({'*':'max'}) to even just get the max on all of it but running into an error.

I've tried sp.groupBy('id').agg({'col1':['max', 'min', 'mean']}) but this is more of a traditional Pandas way to do it but it doesn't work.

id  max(col1)  max(col2)  max(col3)  min(col1) min(col2) min(col3) mean(col1) ..
a   4          7          7          2         4         5         3   
b   6          4          9          2         0         4         4  
c   4          4          9          4         4         9         4  
Matt W.
  • 3,692
  • 2
  • 23
  • 46
  • 1
    Possible duplicate of [Spark SQL: apply aggregate functions to a list of columns](https://stackoverflow.com/questions/33882894/spark-sql-apply-aggregate-functions-to-a-list-of-columns) and [Multiple Aggregate operations on the same column of a spark dataframe](https://stackoverflow.com/questions/34954771/multiple-aggregate-operations-on-the-same-column-of-a-spark-dataframe) – pault Jun 20 '19 at 19:13
  • 6
    Full details in the duplicates, but you want to do: `from pyspark.sql.functions import max as max_` and then `sp.groupBy('id').agg(*[max_(c) for c in sp.columns[1:]])` - you can expand this to also include the `mean` and `min`. – pault Jun 20 '19 at 19:15

1 Answers1

2

Try this:

%%pyspark
SP_agg = sp.groupBy(
    sp.id.alias('identity')
    ).agg(
        sum("col1").alias("Annual_col1"), 
        sum("col2").alias("Annual_col2"), 
        sum("col3").alias("Annual_col3"), 
        mean("col1").alias("mean_col1"), 
        mean("col2").alias("mean_col2"), 
        mean("col3").alias("mean_col3"), 
        min("col1").alias("min_col1"), 
        min("col2").alias("min_col2"), 
        min("col3").alias("min_col3"), 
        max("col1").alias("max_col1"), 
        max("col2").alias("max_col2"), 
        max("col3").alias("max_col3") 
        )
SP_agg.show(10)
Loku
  • 201
  • 2
  • 5