I have a dataframe like this:
----------------------------------------------
| User_ID | Timestamp | Article_ID |
----------------------------------------------
| 121212 | 2018-01-15 10:00:00 | 1 |
| 121212 | 2018-01-15 10:05:00 | 11 |
| 121212 | 2018-01-15 10:10:00 | 12 |
| 989898 | 2018-01-15 17:30:00 | 100 |
| 989898 | 2018-01-15 17:40:00 | 200 |
| 989898 | 2018-01-15 17:50:00 | 1 |
| 989898 | 2018-01-15 17:55:00 | 11 |
|... | | |
----------------------------------------------
Now i want the row with the minimum Timestamp per User_ID. The result should be:
----------------------------------------------
| User_ID | Timestamp | Article_ID |
----------------------------------------------
| 121212 | 2018-01-15 10:00:00 | 1 |
| 989898 | 2018-01-15 17:30:00 | 100 |
|... | | |
----------------------------------------------
I tried the following:
df.groupBy('User_ID').agg(F.min('Timestamp')).show()
That's not so bad, but the column 'Article_ID' is missing... Can someone please help me?