Split a data frame into two random samples with equal proportions of multiple variables

Question

I'm trying to run k-folds cross-validation for a glm model with unequally distributed factor levels, so when I split the data into separate calibration/validation data frames, I inevitably end up with certain factor levels present only in one of the two.

So say I have the following data frame:

set.seed(3.14)
df<-data.frame(x1=sample(0:1,size=100,replace=T),
               x2=sample(0:2,size=100,replace=T),
               y =sample(0:1,size=100,replace=T))
df<-as.data.frame(apply(df,MARGIN=2,FUN=as.factor))
> sapply(df,FUN=summary)
$x1
0  1 
51 49 

$x2
0  1  2 
37 32 31 

$y
0  1 
48 52

How can I randomly split it into two dataframes with somewhat-equal proportions of factor levels across all variables?

For example, the summary for an 80/20 split would look something like this:

calibration:

$x1
0   1
41  39
$x2
0    1   2
30   26  25
$y
0   1
38  42

Validation:

Note: This is a simplified example. The actual data has 20+ variables with as many as 9 or 10 factor levels.

Also, if anyone knows of a better way to solve this problem, I'm open to suggestions.

If there are correlations between your variables, this could be difficult. How many rows of data do you have in your actual data set? — Gregor Thomas, Jul 27 '17 at 16:08
@Gregor There are 8,770 rows. It doesn't have to be a perfect split. I just want to make sure each variable is adequately represented in both calibration and validation data frames. — AffableAmbler, Jul 27 '17 at 16:11
That's enough data that an 80/20 split should be fairly representative. The risk would be if you have a few very uncommon levels. You can look at questions on [stratified random sampling](https://stackoverflow.com/q/23479512/903061), but controlling for 20+ variables can be difficult. Your best strategy might be to identify 1-3 variables with the least common factor levels and just pay attention to them. — Gregor Thomas, Jul 27 '17 at 16:27
One way would be to sort the data by the factor levels you are most concerned about and take every n-th observation. — Andrew Gustar, Jul 27 '17 at 16:43

Split a data frame into two random samples with equal proportions of multiple variables

0 Answers0