To group by a Spark data-frame with pyspark I use command like that:
df2 = df.groupBy('_c1','_c3').agg({'_c4':'max', '_c2' : 'avg'})
As a result I get output like that:
+-----------------+-------------+------------------+--------+
| _c1| _c3| avg(_c2)|max(_c4)|
+-----------------+-------------+------------------+--------+
| Local-gov| HS-grad| 644952.5714285715| 9|
| Local-gov| Assoc-acdm|365081.64285714284| 12|
| Never-worked| Some-college| 462294.0| 10|
| Local-gov| Bachelors| 398296.35| 13|
| Federal-gov| HS-grad| 493293.0| 9|
| Private| 12th| 632520.5454545454| 8|
| State-gov| Assoc-voc| 412814.0| 11|
| ?| HS-grad| 545870.9230769231| 9|
| Private| Prof-school|340322.89130434784| 15|
+-----------------+-------------+------------------+--------+
Which is nice but there are two things that I miss:
- I would like to have a control over the names of the columns. For example I want a new column to be named
avg_c2
insteadavg(_c2)
. I want to aggregate the same column in different ways. For example I might want to know minimum and maximum of column
_c4
. I tried that following and it does not work:df2 = df.groupBy('_c1','_c3').agg({'_c4':('min','max'), '_c2' : 'avg'})
Is there a way to achieve what I need?