Select corresponding value of not included column in groupBy in spark dataframe in java

Question

I have a dataframe as follows

col1, col2, version_time, col3

root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = true)
 |-- version_time: timestamp (nullable = true) 
 |-- col3: string (nullable = true)

Below are some sample rows

col1  col2  timestamp                 col3
 1     A    2021-05-09T13:53:20.219Z   B
 2     A    2021-01-09T13:53:20.219Z   C
 3     A    2021-02-09T13:53:20.219Z   D
 1     A    2020-05-09T13:53:20.219Z   E
 1     A    2019-05-09T13:53:20.219Z   F

What I want is to groupBy col1 and col2 with aggregate on max(timestamp) and also return all the columns.

col1  col2  timestamp                 col3
 1     A    2021-05-09T13:53:20.219Z   B
 2     A    2021-01-09T13:53:20.219Z   C
 3     A    2021-02-09T13:53:20.219Z   D

If i use groupBy on dataframe it will drop col3. I will have to join with original dataframe to get the value of col3

    col1  col2  timestamp                 
     1     A    2021-05-09T13:53:20.219Z
     2     A    2021-01-09T13:53:20.219Z
     3     A    2021-02-09T13:53:20.219Z

If I use Window.partitionBy, i will still have 5 rows with same timestamp value for col1 and col2, which is not what i want.

col1  col2  timestamp                 col3
 1     A    2021-05-09T13:53:20.219Z   B
 2     A    2021-01-09T13:53:20.219Z   C
 3     A    2021-02-09T13:53:20.219Z   D
 1     A    2021-05-09T13:53:20.219Z   E
 1     A    2021-05-09T13:53:20.219Z   F

Is there any other alternative?

You can use rank window function partition on col1 and col2 and sort it based on the timestamp ,then select the records where rank=1 . Spark sql equivalent will be something like this. select * from (select col1,col2,rank() over(partition by col1,col2 order by timestamp desc) as rnk)temp where rnk=1 — linusRian, Jun 02 '21 at 04:21
@linusRian, thanks for the comment. Do you know how to select the rank=1 in Java — BigDataLearner, Jun 02 '21 at 05:09
you can read the dataframe and store it in a temp table using createOrReplaceTempView,then you can execute the above query using spark.sql("yourquerygoeshere"). Please refer to the link ..https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#running-sql-queries-programmatically .You can also do it using the dataframe apis. — linusRian, Jun 02 '21 at 05:28
I have added it as an answer as the suggestion worked for you. — linusRian, Jun 02 '21 at 06:11

score 1 · Accepted Answer · answered Jun 02 '21 at 06:11

You can use rank window function partition on col1 and col2 and sort it based on the timestamp ,then select the records where rank=1 . Spark sql equivalent will be something like this.

select * from (select col1,col2,rank() over(partition by col1,col2 order by timestamp desc) as rnk)temp where rnk=1

Select corresponding value of not included column in groupBy in spark dataframe in java

1 Answers1