4

I need to select all columns from a dataframe by grouping on 'ID'. But when I do that I only get the ID and 'value'. I need all columns

a=df.groupby(df['id']).agg({"date": "max"}
a.show()

This only selects 'id' and 'date' columns. There are other columns. How do I select all columns for the max value in date.

Shaido
  • 27,497
  • 23
  • 70
  • 73
Tinniam V. Ganesh
  • 1,979
  • 6
  • 26
  • 51
  • 3
    Possible duplicate of [GroupBy column and filter rows with maximum value in Pyspark](https://stackoverflow.com/questions/48829993/groupby-column-and-filter-rows-with-maximum-value-in-pyspark) – pault Sep 06 '18 at 15:11

1 Answers1

2

In spark there are two ways either join it with the previous dataframe like this :

a=df.groupby(df['id']).agg({"date": "max"}
df = df.join(
    a,
    on = "id",
    how = "inner"
)
df.show()

or use window partition by like this :

from pyspark.sql import Window
import pyspark.sql.functions as F
window = Window.partitionBy("id")
a = df.withColumn(
    "max",
    (F.max(F.col("date")).over(window))
)
a.show() 

I would say to prefer the first one as it is less costly even after join.

Ankit Kumar Namdeo
  • 1,426
  • 1
  • 12
  • 24
  • for second method `TypeError: withColumn() missing 1 required positional argument: 'col' ` – alex3465 May 25 '21 at 14:45
  • Thanks for pointing it out, edited the solution. withColumn takes two positional arguments first is the name of the column and another expression for new column – Ankit Kumar Namdeo May 26 '21 at 04:38