split data frame keep at least one factor in both subsets

Question

I have a data frame of 200k rows, 10 columns. ( only include 3 here for ease of reading)

df=data.frame(A=rep(letters[1:20],10000),B=rep(letters[2:21],10000),C=rep(letters[3:22],10000))

I am trying to split the data into two subsets - training and testing.

s=sample(dim(df[1],.6*dim(df)[1])
training=df[s,]
testing=df[-s,]

Is there a way to take a sample from df such that there is at least one factor in each of the resulting subsets? That is, from column A-J, I want at least one instance of each of the factors in both the training and testing sets.

I tried Random subset containing at least one instance of each factor but cannot apply it to multiple columns, as opposed to the single on used in the example.

You may want to take a look at this question from yesterday: http://stackoverflow.com/questions/40353057/sampling-a-specific-age-distribution-from-a-dataset — Dave2e, Nov 01 '16 at 12:51
@Dave2e that example holds for a single variable, not for the 10 I am interested in splitting — frank, Nov 01 '16 at 13:11

score 0 · Answer 1 · edited May 23 '17 at 12:34

0

I think you are looking to do stratified sampling. There are a few options to do so in R, e.g. strata in the sampling package. Check out this discussion.

edited May 23 '17 at 12:34

Community

1
1

answered Nov 01 '16 at 12:54

metasequoia

7,014
5
41
54

score 0 · Answer 2 · answered Nov 01 '16 at 15:58

0

s=createDataPartition(paste(df$A, df$B,df$C),p=.6,list=FALSE)
training=df[s,]
testing=df[-s,]

worked

answered Nov 01 '16 at 15:58

frank

3,036
7
33
65

split data frame keep at least one factor in both subsets

2 Answers2