0

I have a dataframe as follows

col1, col2, version_time, col3

root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = true)
 |-- version_time: timestamp (nullable = true) 
 |-- col3: string (nullable = true)

Below are some sample rows

col1  col2  timestamp                 col3
 1     A    2021-05-09T13:53:20.219Z   B
 2     A    2021-01-09T13:53:20.219Z   C
 3     A    2021-02-09T13:53:20.219Z   D
 1     A    2020-05-09T13:53:20.219Z   E
 1     A    2019-05-09T13:53:20.219Z   F

What I want is to groupBy col1 and col2 with aggregate on max(timestamp) and also return all the columns.

col1  col2  timestamp                 col3
 1     A    2021-05-09T13:53:20.219Z   B
 2     A    2021-01-09T13:53:20.219Z   C
 3     A    2021-02-09T13:53:20.219Z   D

If i use groupBy on dataframe it will drop col3. I will have to join with original dataframe to get the value of col3

    col1  col2  timestamp                 
     1     A    2021-05-09T13:53:20.219Z
     2     A    2021-01-09T13:53:20.219Z
     3     A    2021-02-09T13:53:20.219Z

If I use Window.partitionBy, i will still have 5 rows with same timestamp value for col1 and col2, which is not what i want.

col1  col2  timestamp                 col3
 1     A    2021-05-09T13:53:20.219Z   B
 2     A    2021-01-09T13:53:20.219Z   C
 3     A    2021-02-09T13:53:20.219Z   D
 1     A    2021-05-09T13:53:20.219Z   E
 1     A    2021-05-09T13:53:20.219Z   F

Is there any other alternative?

BigDataLearner
  • 1,388
  • 4
  • 19
  • 40
  • 1
    You can use rank window function partition on col1 and col2 and sort it based on the timestamp ,then select the records where rank=1 . Spark sql equivalent will be something like this. select * from (select col1,col2,rank() over(partition by col1,col2 order by timestamp desc) as rnk)temp where rnk=1 – linusRian Jun 02 '21 at 04:21
  • @linusRian, thanks for the comment. Do you know how to select the rank=1 in Java – BigDataLearner Jun 02 '21 at 05:09
  • you can read the dataframe and store it in a temp table using createOrReplaceTempView,then you can execute the above query using spark.sql("yourquerygoeshere"). Please refer to the link ..https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#running-sql-queries-programmatically .You can also do it using the dataframe apis. – linusRian Jun 02 '21 at 05:28
  • solution works beautifully.. thanks – BigDataLearner Jun 02 '21 at 05:39
  • great, you are welcome :) – linusRian Jun 02 '21 at 05:49
  • I have added it as an answer as the suggestion worked for you. – linusRian Jun 02 '21 at 06:11

1 Answers1

1

You can use rank window function partition on col1 and col2 and sort it based on the timestamp ,then select the records where rank=1 . Spark sql equivalent will be something like this.

select * from (select col1,col2,rank() over(partition by col1,col2 order by timestamp desc) as rnk)temp where rnk=1
linusRian
  • 340
  • 2
  • 12