df
Date Col1 COl2
2010-01-01 23 28
2012-09-01 50 70
2010-03-04 80 10
2012-04-01 19 20
2012-03-05 67 9
df_new=df.withColumn('year',year(df['Date']))
Date Col1 COl2 year
2010-01-01 23 28 2010
2012-09-01 50 70 2012 and so on
Now, I am trying to find the maximum of Col1 and Col2 for each year. So I use groupby:
df_new.groupby('year').max().show()
THe result I get is not what I expected. Result obtained
year max(year)
2010 2010
2012 2012 and so on
Expected result
year max(Col1) max(Col2)
2010 80 28
2012 67 70