1

I'm trying to run k-folds cross-validation for a glm model with unequally distributed factor levels, so when I split the data into separate calibration/validation data frames, I inevitably end up with certain factor levels present only in one of the two.

So say I have the following data frame:

set.seed(3.14)
df<-data.frame(x1=sample(0:1,size=100,replace=T),
               x2=sample(0:2,size=100,replace=T),
               y =sample(0:1,size=100,replace=T))
df<-as.data.frame(apply(df,MARGIN=2,FUN=as.factor))
> sapply(df,FUN=summary)
$x1
0  1 
51 49 

$x2
0  1  2 
37 32 31 

$y
0  1 
48 52 

How can I randomly split it into two dataframes with somewhat-equal proportions of factor levels across all variables?

For example, the summary for an 80/20 split would look something like this:

calibration:

$x1
0   1
41  39
$x2
0    1   2
30   26  25
$y
0   1
38  42

Validation:

$x1
0   1
10  10
$x2
0   1  2
7   6  6
$y
0   1
10  10

Note: This is a simplified example. The actual data has 20+ variables with as many as 9 or 10 factor levels.

Also, if anyone knows of a better way to solve this problem, I'm open to suggestions.

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
AffableAmbler
  • 377
  • 3
  • 15
  • If there are correlations between your variables, this could be difficult. How many rows of data do you have in your actual data set? – Gregor Thomas Jul 27 '17 at 16:08
  • @Gregor There are 8,770 rows. It doesn't have to be a perfect split. I just want to make sure each variable is adequately represented in both calibration and validation data frames. – AffableAmbler Jul 27 '17 at 16:11
  • 1
    That's enough data that an 80/20 split should be fairly representative. The risk would be if you have a few very uncommon levels. You can look at questions on [stratified random sampling](https://stackoverflow.com/q/23479512/903061), but controlling for 20+ variables can be difficult. Your best strategy might be to identify 1-3 variables with the least common factor levels and just pay attention to them. – Gregor Thomas Jul 27 '17 at 16:27
  • Good idea. Thanks! – AffableAmbler Jul 27 '17 at 16:29
  • 1
    One way would be to sort the data by the factor levels you are most concerned about and take every n-th observation. – Andrew Gustar Jul 27 '17 at 16:43

0 Answers0