How to apply different aggregation functions to the same column why grouping spark dataframe?

Question

To group by a Spark data-frame with pyspark I use command like that:

df2 = df.groupBy('_c1','_c3').agg({'_c4':'max', '_c2' : 'avg'})

As a result I get output like that:

+-----------------+-------------+------------------+--------+                   
|              _c1|          _c3|          avg(_c2)|max(_c4)|
+-----------------+-------------+------------------+--------+
|        Local-gov|      HS-grad| 644952.5714285715|       9|
|        Local-gov|   Assoc-acdm|365081.64285714284|      12|
|     Never-worked| Some-college|          462294.0|      10|
|        Local-gov|    Bachelors|         398296.35|      13|
|      Federal-gov|      HS-grad|          493293.0|       9|
|          Private|         12th| 632520.5454545454|       8|
|        State-gov|    Assoc-voc|          412814.0|      11|
|                ?|      HS-grad| 545870.9230769231|       9|
|          Private|  Prof-school|340322.89130434784|      15|
+-----------------+-------------+------------------+--------+

Which is nice but there are two things that I miss:

I would like to have a control over the names of the columns. For example I want a new column to be named avg_c2 instead avg(_c2).
I want to aggregate the same column in different ways. For example I might want to know minimum and maximum of column _c4. I tried that following and it does not work:

df2 = df.groupBy('_c1','_c3').agg({'_c4':('min','max'), '_c2' : 'avg'})

Is there a way to achieve what I need?

Ramesh Maharjan · Accepted Answer · 2017-06-10T11:54:23.753

5

you have to use withColumn api and generate new columns or replace the old ones

Or you can use alias to have the required column name instead of default avg(_c2)

I haven't used pyspark yet but in scala I do something like

import org.apache.spark.sql.functions._    
df2 = df.groupBy("_c1","_c3").agg(max(col("_c4")).alias("max_c4"), min(col("_c4")).alias("min_c4"), avg(col("_c2")).alias("avg_c2"))

edited Jun 10 '17 at 11:54

answered Jun 10 '17 at 10:47

Ramesh Maharjan

41,071
6
69
97

I get an `SyntaxError`. The complain is about colon `:` in `('_c4':'max')`. Are you sure about the syntax? What you have in your example is neither tuple no dictionary. – Roman Jun 10 '17 at 11:36
I haven't coded in pyspark yet so I don't know the correct syntax but I have updated my answer in scala code. You can try `df2 = df.groupBy('_c1','_c3').agg({'_c4':'max' as 'max_c4', '_c4':'min' as 'min_c4', '_c2' : 'avg' as 'avg_c2'})` – Ramesh Maharjan Jun 10 '17 at 11:44
now I get `NameError: name 'col' is not defined`. – Roman Jun 10 '17 at 11:50
col is defined in `org.apache.spark.sql.functions.col` which you have to `import`. thats in scala. I am not familiar with how you do in pyspark – Ramesh Maharjan Jun 10 '17 at 11:54
I use `pyspark`. So, I guess I need to `from pyspark.sql.functions import col`. If I do that I get `TypeError: Column is not iterable`. – Roman Jun 10 '17 at 12:52
@ Ramesh Maharjan, now your example works. Thank you. Actually I do not know what was a problem. I cannot reproduce the mistake. Maybe it was some type or I did not import something. Now it works in the form you have provided. Thank you! – Roman Jun 10 '17 at 13:35
which one worked? python or scala? – Ramesh Maharjan Jun 10 '17 at 14:18

How to apply different aggregation functions to the same column why grouping spark dataframe?

1 Answers1