0

I have a data frame of 200k rows, 10 columns. ( only include 3 here for ease of reading)

df=data.frame(A=rep(letters[1:20],10000),B=rep(letters[2:21],10000),C=rep(letters[3:22],10000))

I am trying to split the data into two subsets - training and testing.

s=sample(dim(df[1],.6*dim(df)[1])
training=df[s,]
testing=df[-s,]

Is there a way to take a sample from df such that there is at least one factor in each of the resulting subsets? That is, from column A-J, I want at least one instance of each of the factors in both the training and testing sets.

I tried Random subset containing at least one instance of each factor but cannot apply it to multiple columns, as opposed to the single on used in the example.

Community
  • 1
  • 1
frank
  • 3,036
  • 7
  • 33
  • 65
  • You may want to take a look at this question from yesterday: http://stackoverflow.com/questions/40353057/sampling-a-specific-age-distribution-from-a-dataset – Dave2e Nov 01 '16 at 12:51
  • @Dave2e that example holds for a single variable, not for the 10 I am interested in splitting – frank Nov 01 '16 at 13:11

2 Answers2

0

I think you are looking to do stratified sampling. There are a few options to do so in R, e.g. strata in the sampling package. Check out this discussion.

Community
  • 1
  • 1
metasequoia
  • 7,014
  • 5
  • 41
  • 54
0
s=createDataPartition(paste(df$A, df$B,df$C),p=.6,list=FALSE)
training=df[s,]
testing=df[-s,]

worked

frank
  • 3,036
  • 7
  • 33
  • 65