2

To group by a Spark data-frame with pyspark I use command like that:

df2 = df.groupBy('_c1','_c3').agg({'_c4':'max', '_c2' : 'avg'})

As a result I get output like that:

+-----------------+-------------+------------------+--------+                   
|              _c1|          _c3|          avg(_c2)|max(_c4)|
+-----------------+-------------+------------------+--------+
|        Local-gov|      HS-grad| 644952.5714285715|       9|
|        Local-gov|   Assoc-acdm|365081.64285714284|      12|
|     Never-worked| Some-college|          462294.0|      10|
|        Local-gov|    Bachelors|         398296.35|      13|
|      Federal-gov|      HS-grad|          493293.0|       9|
|          Private|         12th| 632520.5454545454|       8|
|        State-gov|    Assoc-voc|          412814.0|      11|
|                ?|      HS-grad| 545870.9230769231|       9|
|          Private|  Prof-school|340322.89130434784|      15|
+-----------------+-------------+------------------+--------+

Which is nice but there are two things that I miss:

  1. I would like to have a control over the names of the columns. For example I want a new column to be named avg_c2 instead avg(_c2).
  2. I want to aggregate the same column in different ways. For example I might want to know minimum and maximum of column _c4. I tried that following and it does not work:

    df2 = df.groupBy('_c1','_c3').agg({'_c4':('min','max'), '_c2' : 'avg'})

Is there a way to achieve what I need?

Roman
  • 124,451
  • 167
  • 349
  • 456

1 Answers1

5

you have to use withColumn api and generate new columns or replace the old ones

Or you can use alias to have the required column name instead of default avg(_c2)

I haven't used pyspark yet but in scala I do something like

import org.apache.spark.sql.functions._    
df2 = df.groupBy("_c1","_c3").agg(max(col("_c4")).alias("max_c4"), min(col("_c4")).alias("min_c4"), avg(col("_c2")).alias("avg_c2"))
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • I get an `SyntaxError`. The complain is about colon `:` in `('_c4':'max')`. Are you sure about the syntax? What you have in your example is neither tuple no dictionary. – Roman Jun 10 '17 at 11:36
  • I haven't coded in pyspark yet so I don't know the correct syntax but I have updated my answer in scala code. You can try `df2 = df.groupBy('_c1','_c3').agg({'_c4':'max' as 'max_c4', '_c4':'min' as 'min_c4', '_c2' : 'avg' as 'avg_c2'})` – Ramesh Maharjan Jun 10 '17 at 11:44
  • now I get `NameError: name 'col' is not defined`. – Roman Jun 10 '17 at 11:50
  • col is defined in `org.apache.spark.sql.functions.col` which you have to `import`. thats in scala. I am not familiar with how you do in pyspark – Ramesh Maharjan Jun 10 '17 at 11:54
  • I use `pyspark`. So, I guess I need to `from pyspark.sql.functions import col`. If I do that I get `TypeError: Column is not iterable`. – Roman Jun 10 '17 at 12:52
  • @ Ramesh Maharjan, now your example works. Thank you. Actually I do not know what was a problem. I cannot reproduce the mistake. Maybe it was some type or I did not import something. Now it works in the form you have provided. Thank you! – Roman Jun 10 '17 at 13:35
  • which one worked? python or scala? – Ramesh Maharjan Jun 10 '17 at 14:18