Pyspark / Python - Use MIN / MAX without losing columns

Question

I have a dataframe like this:

----------------------------------------------
| User_ID |      Timestamp      | Article_ID |
----------------------------------------------
| 121212  | 2018-01-15 10:00:00 |      1     |
| 121212  | 2018-01-15 10:05:00 |      11    |
| 121212  | 2018-01-15 10:10:00 |      12    |
| 989898  | 2018-01-15 17:30:00 |      100   |
| 989898  | 2018-01-15 17:40:00 |      200   |
| 989898  | 2018-01-15 17:50:00 |      1     |
| 989898  | 2018-01-15 17:55:00 |      11    |
|...      |                     |            |
----------------------------------------------

Now i want the row with the minimum Timestamp per User_ID. The result should be:

----------------------------------------------
| User_ID |      Timestamp      | Article_ID |
----------------------------------------------
| 121212  | 2018-01-15 10:00:00 |      1     |
| 989898  | 2018-01-15 17:30:00 |      100   |
|...      |                     |            |
----------------------------------------------

I tried the following:

df.groupBy('User_ID').agg(F.min('Timestamp')).show()

That's not so bad, but the column 'Article_ID' is missing... Can someone please help me?

qwertz · Accepted Answer · 2018-01-15T22:36:26.587

4

I found a working solution with function struct():

df.select('User_ID',F.struct('Timestamp','Article_ID').alias("TA")).groupBy('User_ID').agg(F.min("TA").alias("TA")).select('User_ID','TA.Timestamp','TA.Article_ID').orderBy('User_ID').limit(10).toPandas()

See also original source: How to select the first row of each group?

edited Jan 15 '18 at 22:36

answered Jan 15 '18 at 22:05

qwertz

373
2
4
16

If you use solution from another answer don't forget [required attribution](https://stackoverflow.blog/2009/06/25/attribution-required/). – Alper t. Turker Jan 15 '18 at 22:08
1

thanks for the note. I added the original source... – qwertz Jan 15 '18 at 22:37

Pyspark / Python - Use MIN / MAX without losing columns

1 Answers1