How to split a data frame into training, validation, and test sets dependent on ID's?

Question

I need a randomised split for my data set into training, validation and test set, such as shown in this post (R: How to split a data frame into training, validation, and test sets?), but it needs to be linked to the splitting subject ID's randomly, not the whole data frame.

When applying the code answered to that question it splits my data frame completely randomly, but I have stacked ID's and need them to stay together or else one subjects data will be distributed over the different sets.

Sorry, if this sounds a bit confusing. Here my data to explain the issue:

df <- c(Contact.ID, Date.Time, Age, Gender, Attendance)

Contact.ID       Date.Time       Age   Gender   Attendance   
1   A       2012-07-06 18:54:48   37    Male         30    
2   A       2012-07-06 20:50:18   37    Male         30    
3   A       2012-08-14 20:18:44   37    Male         30   
4   B       2012-03-15 16:58:15   27  Female         40    
5   B       2012-04-18 10:57:02   27  Female         40    
6   B       2012-04-18 17:31:22   27  Female         40    
7   B       2012-04-18 18:37:00   27  Female         40    
8   C       2013-10-22 17:46:07   40    Male         5    
9   C       2013-10-27 11:21:00   40    Male         5    
10  D       2012-07-28 14:48:33   20  Female         12

If I split this data randomly, subject A's entries could, for instance, have two in my test set and one in my validation set. But I would need a random split of different ID's not random split of the whole data frame and I can not figure out how to connect these.

score 4 · Accepted Answer · answered Aug 20 '17 at 00:02

The code you posted from the previous train/validate/test question assigns a train, validate, or test label to each row of a data frame and then splits based on the label of each row:

spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
  seq(nrow(df)), 
  nrow(df)*cumsum(c(0,spec)),
  labels = names(spec)
))
res = split(df, g)

Instead, you could assign a label to each unique level of your ID factor variable and split based on the label assigned to the ID of each row:

set.seed(144)
spec = c(train = .6, test = .2, validate = .2)
g = sample(cut(
  seq_along(unique(df$Contact.ID)), 
  length(unique(df$Contact.ID))*cumsum(c(0,spec)),
  labels = names(spec)
))
(res = split(df, g[as.factor(df$Contact.ID)]))
# $train
#   Contact.ID          Date.Time Age Gender Attendance
# 1          A 2012-07-0618:54:48  37   Male         30
# 2          A 2012-07-0620:50:18  37   Male         30
# 3          A 2012-08-1420:18:44  37   Male         30
# 8          C 2013-10-2217:46:07  40   Male          5
# 9          C 2013-10-2711:21:00  40   Male          5
# 
# $test
#   Contact.ID          Date.Time Age Gender Attendance
# 4          B 2012-03-1516:58:15  27 Female         40
# 5          B 2012-04-1810:57:02  27 Female         40
# 6          B 2012-04-1817:31:22  27 Female         40
# 7          B 2012-04-1818:37:00  27 Female         40
# 
# $validate
#    Contact.ID          Date.Time Age Gender Attendance
# 10          D 2012-07-2814:48:33  20 Female         12

Note that this changes the interpretation of the split proportions: the 60% assigned to the training set is now 60% of the unique subject IDs, not 60% of the rows.

Hey :) Thanks a lot for your reply! I get your approach and how it should work, but it doesn't work on my data. I have around 13,000 ID's and when I run your code, it splits my 280,000 observations according to the `spec` but the different sets still have 13,000 ID's each, so it is still not splitting my ID's according to the `spec`. Any idea how I can change this? — Fee, Aug 20 '17 at 23:48
Hey, I think I fixed it: I had used my ID's as factor variables but I tried changing them now to numeric and I think it worked. Would that make sense to have been the issue? — Fee, Aug 21 '17 at 00:18
@Fee the code I've posted here should work for any data type of the ID field. It would be helpful if you could share some actual data, e.g. the output of `dput(head(df))`, if you are having issues getting it to work for your data. — josliber, Aug 21 '17 at 01:18
hey, yea I tried it on a small sample set and it did work. The difference is that when the ID is a factor, it registers all the levels, while if it is an integer, it register only the present ID's. Therefore, yes, the code works on either variable type, but it is just a bit more messy when used as a factor. Many Thanks for your help! :) — Fee, Aug 21 '17 at 01:33

How to split a data frame into training, validation, and test sets dependent on ID's?

1 Answers1