python How to select the latest sample per user as testing data?

Question

My data is as below. I want to sort by the timestamp and use the latest sample of each userid as the testing data. How should I do the train and test split? What I have tried is using pandas to sort_values timestamp and then groupby 'userid'. But I only get a groupby object. What is the correct way to do that? Is pyspark a better tool?

After I get the dataframe of the testing data, how should split data? Obviously I cannot use sklearn's train_test_split.

You have to group by user id and get max timestamp out of each group. You should be able to get the answer here: https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby — Anatoliy R, Nov 13 '19 at 03:58

Juan Estevez · Answer 1 · 2019-11-13T11:11:01.033

1

You could do the following:

# Sort the data by time stamp
df = df.sort_values('timestamp')

# Group by userid and get the last entry from each group
test_df = df.groupby(by='userid', as_index=False).nth(-1)

# The rest of the values
train_df = df.drop(test_df.index)

edited Nov 13 '19 at 11:11

answered Nov 13 '19 at 04:12

Juan Estevez

837
7
13

What? Really? `nth`? – pissall Nov 13 '19 at 04:34
Can I have a further question? After I getting the testing data, how should I split them from original data frame? df here is the testing data I want, but I should I remove them from original data frame so I can get the trining data? Thanks – user10262232 Nov 13 '19 at 09:06

score 0 · Answer 2 · answered Nov 13 '19 at 04:38

You can do the following:

import pyspark.sql.functions as F

max_df = df.groupby("userid").agg(F.max("timestamp"))
# join it back to the original DF 
df = df.join(max_df, on="userid")
train_df = df.filter(df["timestamp"] != df["max(timestamp)"])
test_df = df.filter(df["timestamp"] == df["max(timestamp)"])

python How to select the latest sample per user as testing data?

2 Answers2