I have a dataframe as follows
col1, col2, version_time, col3
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = true)
|-- version_time: timestamp (nullable = true)
|-- col3: string (nullable = true)
Below are some sample rows
col1 col2 timestamp col3
1 A 2021-05-09T13:53:20.219Z B
2 A 2021-01-09T13:53:20.219Z C
3 A 2021-02-09T13:53:20.219Z D
1 A 2020-05-09T13:53:20.219Z E
1 A 2019-05-09T13:53:20.219Z F
What I want is to groupBy col1 and col2 with aggregate on max(timestamp) and also return all the columns.
col1 col2 timestamp col3
1 A 2021-05-09T13:53:20.219Z B
2 A 2021-01-09T13:53:20.219Z C
3 A 2021-02-09T13:53:20.219Z D
If i use groupBy on dataframe it will drop col3
. I will have to join with original dataframe to get the value of col3
col1 col2 timestamp
1 A 2021-05-09T13:53:20.219Z
2 A 2021-01-09T13:53:20.219Z
3 A 2021-02-09T13:53:20.219Z
If I use Window.partitionBy, i will still have 5 rows with same timestamp value for col1 and col2, which is not what i want.
col1 col2 timestamp col3
1 A 2021-05-09T13:53:20.219Z B
2 A 2021-01-09T13:53:20.219Z C
3 A 2021-02-09T13:53:20.219Z D
1 A 2021-05-09T13:53:20.219Z E
1 A 2021-05-09T13:53:20.219Z F
Is there any other alternative?