Is there a way to split the data into train and test such that all combinations of categorical predictors in the test data are present in the training data? If it is not possible to split the data given the proportions specified for the test and train sizes, then those levels should not be included in the test data.
Say I have data like this:
SAMPLE_DF <- data.frame("FACTOR1" = c(rep(letters[1:2], 8), "g", "g", "h", "i"),
"FACTOR2" = c(rep(letters[3:5], 2,), rep("z", 3), "f"),
"response" = rnorm(10,10,1),
"node" = c(rep(c(1,2),5)))
> SAMPLE_DF
FACTOR1 FACTOR2 response node
1 a c 10.334690 1
2 b d 11.467605 2
3 a e 8.935463 1
4 b c 10.253852 2
5 a d 11.067347 1
6 b e 10.548887 2
7 a z 10.066082 1
8 b z 10.887074 2
9 a z 8.802410 1
10 b f 9.319187 2
11 a c 10.334690 1
12 b d 11.467605 2
13 a e 8.935463 1
14 b c 10.253852 2
15 a d 11.067347 1
16 b e 10.548887 2
17 g z 10.066082 1
18 g z 10.887074 2
19 h z 8.802410 1
20 i f 9.319187 2
In the test data, if there were a combination of FACTOR 1 and 2 of a
c
then this would also be in the train data. The same goes for all other possible combinations.
createDataPartition does this for one level, but I would like it for all levels.