Apache SPark: groupby not working as expected

Question

df

  Date          Col1   COl2
  2010-01-01     23     28
  2012-09-01     50     70
  2010-03-04     80     10
  2012-04-01     19      20
  2012-03-05     67      9

  df_new=df.withColumn('year',year(df['Date']))

  Date          Col1   COl2   year
  2010-01-01     23     28    2010
  2012-09-01     50     70    2012   and so on

Now, I am trying to find the maximum of Col1 and Col2 for each year. So I use groupby:

   df_new.groupby('year').max().show()

THe result I get is not what I expected. Result obtained

   year    max(year)
   2010    2010
   2012    2012   and so on

Expected result

    year    max(Col1)    max(Col2)
    2010     80            28
    2012     67             70

You can aggregate values of multiple columns using `.agg()`. Google can provide further assistance https://stackoverflow.com/questions/36251004/pyspark-aggregation-on-multiple-columns — mazaneicha, Jun 14 '20 at 13:40
Does this answer your question? [Pyspark - Aggregation on multiple columns](https://stackoverflow.com/questions/36251004/pyspark-aggregation-on-multiple-columns) — notNull, Jun 14 '20 at 13:45
@mazaneicha just for my knowledge , i tried OPs code and it worked(returned aggregated values of all columns) , does this mean that the dtypes are not numeric? ( is this the reason why it didnt work for OP or do you foresee any other issues..) )aggregate with max would still work on a string but might be inconsistent.., thanks :) — anky, Jun 14 '20 at 13:51
@anky right, _max(*cols)[source] Computes the max value for each **numeric** columns for each group._ And if OP tried `df_new.groupby('year').max('Col1').show()` they would get "Col1 is not a numeric column" or smth like that. — mazaneicha, Jun 14 '20 at 15:09
@mazaneicha Thanks, so then there might be another question saying max not working correctly :) — anky, Jun 14 '20 at 15:10

Srinivas · Accepted Answer · 2020-06-14T13:41:27.333

1

Check below code.

from pyspark.sql import functions as F

df.withColumn('year',F.year(df['date'])).groupBy("year").agg(F.max("col1").alias("max_col1"),F.max("col2").alias("max_col2")).show()

edited Jun 14 '20 at 13:41

answered Jun 14 '20 at 13:37

Srinivas

8,957
2
12
26

Error `F` is not defined – noob Jun 14 '20 at 13:40
added import , check now. – Srinivas Jun 14 '20 at 13:41

QuickSilver · Answer 2 · 2020-06-14T13:49:28.290

1

you should perform multiple max , agg on the Col1 and Col2

from pyspark.sql import functions as F
df_new.groupBy(F.year("Date")).agg(F.max("Col1"),F.max("Col2"))
      .show()

edited Jun 14 '20 at 13:49

answered Jun 14 '20 at 13:40

QuickSilver

3,915
2
13
29

happy to help @ShailajaGuptaKapoor – QuickSilver Jun 14 '20 at 13:43
THis does not answer my question as i wanted to groupby year. YOur answer groups by date, but sill upvoted.....thanks for help – noob Jun 14 '20 at 13:47
updated the ans to extract year from date column @ShailajaGuptaKapoor – QuickSilver Jun 14 '20 at 13:49

dsk · Answer 3 · 2020-06-15T06:35:10.997

In case you have a huge dataset it is better to use a Window function in such cases as below , This performs way better than groupBy

from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
df = spark.table("test_poc")
df = df.withColumn("Year", F.year(F.col('date')))
_w = W.partitionBy(F.col('year'))
df = df.withColumn('max_col', F.max('id').over(_w)).withColumn('min_col', F.min('id').over(_w))
df.show()

---------OUTPUT------------

+---+-------------------+----+-------+-------+
| id|               date|Year|max_col|min_col|
+---+-------------------+----+-------+-------+
|  5|2019-12-31 23:26:59|2019|      5|      2|
|  2|2019-12-31 23:26:59|2019|      5|      2|
|  1|1969-12-31 23:26:59|1969|      3|      1|
|  2|1969-12-31 23:26:30|1969|      3|      1|
|  3|1969-12-31 23:26:26|1969|      3|      1|
|  4|2020-12-31 23:26:59|2020|      4|      1|
|  1|2020-12-31 23:26:59|2020|      4|      1|
+---+-------------------+----+-------+-------+

Apache SPark: groupby not working as expected

3 Answers3