1

My data is as below. I want to sort by the timestamp and use the latest sample of each userid as the testing data. How should I do the train and test split? What I have tried is using pandas to sort_values timestamp and then groupby 'userid'. But I only get a groupby object. What is the correct way to do that? Is pyspark a better tool?

data

After I get the dataframe of the testing data, how should split data? Obviously I cannot use sklearn's train_test_split.

user10262232
  • 71
  • 1
  • 9
  • 1
    You have to group by user id and get max timestamp out of each group. You should be able to get the answer here: https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby – Anatoliy R Nov 13 '19 at 03:58

2 Answers2

1

You could do the following:

# Sort the data by time stamp
df = df.sort_values('timestamp')

# Group by userid and get the last entry from each group
test_df = df.groupby(by='userid', as_index=False).nth(-1)

# The rest of the values
train_df = df.drop(test_df.index)
Juan Estevez
  • 837
  • 7
  • 13
  • What? Really? `nth`? – pissall Nov 13 '19 at 04:34
  • Can I have a further question? After I getting the testing data, how should I split them from original data frame? df here is the testing data I want, but I should I remove them from original data frame so I can get the trining data? Thanks – user10262232 Nov 13 '19 at 09:06
0

You can do the following:

import pyspark.sql.functions as F

max_df = df.groupby("userid").agg(F.max("timestamp"))
# join it back to the original DF 
df = df.join(max_df, on="userid")
train_df = df.filter(df["timestamp"] != df["max(timestamp)"])
test_df = df.filter(df["timestamp"] == df["max(timestamp)"])
pissall
  • 7,109
  • 2
  • 25
  • 45