2

I have a dataframe like this:

----------------------------------------------
| User_ID |      Timestamp      | Article_ID |
----------------------------------------------
| 121212  | 2018-01-15 10:00:00 |      1     |
| 121212  | 2018-01-15 10:05:00 |      11    |
| 121212  | 2018-01-15 10:10:00 |      12    |
| 989898  | 2018-01-15 17:30:00 |      100   |
| 989898  | 2018-01-15 17:40:00 |      200   |
| 989898  | 2018-01-15 17:50:00 |      1     |
| 989898  | 2018-01-15 17:55:00 |      11    |
|...      |                     |            |
----------------------------------------------

Now i want the row with the minimum Timestamp per User_ID. The result should be:

----------------------------------------------
| User_ID |      Timestamp      | Article_ID |
----------------------------------------------
| 121212  | 2018-01-15 10:00:00 |      1     |
| 989898  | 2018-01-15 17:30:00 |      100   |
|...      |                     |            |
----------------------------------------------

I tried the following:

df.groupBy('User_ID').agg(F.min('Timestamp')).show()

That's not so bad, but the column 'Article_ID' is missing... Can someone please help me?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
qwertz
  • 373
  • 2
  • 4
  • 16

1 Answers1

4

I found a working solution with function struct():

df.select('User_ID',F.struct('Timestamp','Article_ID').alias("TA")).groupBy('User_ID').agg(F.min("TA").alias("TA")).select('User_ID','TA.Timestamp','TA.Article_ID').orderBy('User_ID').limit(10).toPandas()

See also original source: How to select the first row of each group?

qwertz
  • 373
  • 2
  • 4
  • 16