How to select all columns for rows with max value

Question

I need to select all columns from a dataframe by grouping on 'ID'. But when I do that I only get the ID and 'value'. I need all columns

a=df.groupby(df['id']).agg({"date": "max"}
a.show()

This only selects 'id' and 'date' columns. There are other columns. How do I select all columns for the max value in date.

Possible duplicate of [GroupBy column and filter rows with maximum value in Pyspark](https://stackoverflow.com/questions/48829993/groupby-column-and-filter-rows-with-maximum-value-in-pyspark) — pault, Sep 06 '18 at 15:11

Ankit Kumar Namdeo · Accepted Answer · 2021-05-26T04:36:34.713

2

In spark there are two ways either join it with the previous dataframe like this :

a=df.groupby(df['id']).agg({"date": "max"}
df = df.join(
    a,
    on = "id",
    how = "inner"
)
df.show()

or use window partition by like this :

from pyspark.sql import Window
import pyspark.sql.functions as F
window = Window.partitionBy("id")
a = df.withColumn(
    "max",
    (F.max(F.col("date")).over(window))
)
a.show()

I would say to prefer the first one as it is less costly even after join.

edited May 26 '21 at 04:36

answered Sep 06 '18 at 19:07

Ankit Kumar Namdeo

1,426
1
12
24

for second method `TypeError: withColumn() missing 1 required positional argument: 'col' ` – alex3465 May 25 '21 at 14:45
Thanks for pointing it out, edited the solution. withColumn takes two positional arguments first is the name of the column and another expression for new column – Ankit Kumar Namdeo May 26 '21 at 04:38

How to select all columns for rows with max value

1 Answers1