How to split into train and test data ensuring same combinations of factors are present in both train and test?

Question

Is there a way to split the data into train and test such that all combinations of categorical predictors in the test data are present in the training data? If it is not possible to split the data given the proportions specified for the test and train sizes, then those levels should not be included in the test data.

Say I have data like this:

SAMPLE_DF <- data.frame("FACTOR1" = c(rep(letters[1:2], 8), "g", "g", "h", "i"),
                        "FACTOR2" = c(rep(letters[3:5], 2,), rep("z", 3), "f"),
                        "response" = rnorm(10,10,1),
                        "node" = c(rep(c(1,2),5)))
> SAMPLE_DF
   FACTOR1 FACTOR2  response node
1        a       c 10.334690    1
2        b       d 11.467605    2
3        a       e  8.935463    1
4        b       c 10.253852    2
5        a       d 11.067347    1
6        b       e 10.548887    2
7        a       z 10.066082    1
8        b       z 10.887074    2
9        a       z  8.802410    1
10       b       f  9.319187    2
11       a       c 10.334690    1
12       b       d 11.467605    2
13       a       e  8.935463    1
14       b       c 10.253852    2
15       a       d 11.067347    1
16       b       e 10.548887    2
17       g       z 10.066082    1
18       g       z 10.887074    2
19       h       z  8.802410    1
20       i       f  9.319187    2

In the test data, if there were a combination of FACTOR 1 and 2 of a c then this would also be in the train data. The same goes for all other possible combinations.

createDataPartition does this for one level, but I would like it for all levels.

Use `createDataPartition` (from whatever corner of the R-universe you found) it on either `interaction( FACTOR1, FACTOR2)` or on `paste(FACTOR1, FACTOR2, sep="_")` — IRTFM, Mar 22 '15 at 21:22

LyzandeR · Answer 1 · 2015-03-23T00:13:45.110

You could try the following using dplyr to remove the combinations that appear only once and therefore would end up only in the training or test set and then use CreateDataPartition to make the split:

Data

SAMPLE_DF <- data.frame("FACTOR1" = rep(letters[1:2], 10),
                         "FACTOR2" = c(rep(letters[3:5], 2,), rep("z", 4)),
                         "num_pred" = rnorm(10,10,1),
                         "response" = rnorm(10,10,1))

Below you use dplyr to count the number of the combinations of factor1 and factor2. If any of those are 1 then you filter them out:

library(dplyr)
 mydf <- 
 SAMPLE_DF %>%
   mutate(all = paste(FACTOR1,FACTOR2)) %>%
   group_by(all) %>%
   summarise(total=n()) %>%
   filter(total>=2)

The above only keeps combinations of factor1 and 2 that appear at least twice

You remove rows from SAMPLE_DF according to the above kept combinations:

SAMPLE_DF2 <- SAMPLE_DF[paste(SAMPLE_DF$FACTOR1,SAMPLE_DF$FACTOR2) %in% mydf$all,]

And finally you let createDataPartition do the split for you:

library(caret)
IND_TRAIN <- createDataPartition(paste(SAMPLE_DF2$FACTOR1,SAMPLE_DF2$FACTOR2))$Resample

 #train set
 A <- SAMPLE_DF2[ IND_TRAIN,]
 #test set
 B <- SAMPLE_DF2[-IND_TRAIN,]
 >identical(sort(paste(A$FACTOR1,A$FACTOR2)) , sort(paste(B$FACTOR1,B$FACTOR2)))
 [1] TRUE

As you can see at the identical line, the combinations are exactly the same!

I've tweaked SAMPLE_DF. This solution did not work with the modified version. I think it may have only worked because it was possible to have the same combinations in both? Not sure. — goldisfine, Mar 22 '15 at 23:03
Sorry goldisfine and thanks for the comment. I had a typo in my `createDataPartition` function. I had written `SAMPLE_DF$FACTOR1` instead of `SAMPLE_DF2$FACTOR1`. I accidentally missed a 2 in there. Try again please and you will see it will work correctly. — LyzandeR, Mar 23 '15 at 00:15

chara · Answer 2 · 2021-11-25T22:00:55.530

This is a function I have put together that does this and splits a dataframe into train and test sets (to the user's percentage liking) that contain Factor variables(columns) that have the same levels.

getTrainAndTestSamples_BalancedFactors <- function(data, percentTrain = 0.75, inSequence = F, seed = 0){
    set.seed(seed) # Set Seed so that same sample can be reproduced in future also
    sample <- NULL
    train <- NULL
    test<- NULL

    listOfFactorsAndTheirLevels <- lapply(Filter(is.factor, data), summary)
    factorContainingOneElement <- sapply(listOfFactorsAndTheirLevels, function(x){any(x==1)})

    if (any(factorContainingOneElement))
        warning("This dataframe cannot be reliably split into sets that contain Factors equally represented. At least one factor contains only 1 possible level.")
    else {
        # Repeat loop until all Factor variables have same levels on both train and test set
        repeat{
          # Now Selecting 'percentTrain' of data as sample from total 'n' rows of the data  
          if (inSequence)
            sample <- 1:floor(percentTrain * nrow(data))
          else
            sample <- sample.int(n = nrow(data), size = floor(percentTrain * nrow(data)), replace = F)    
  
          train <- data[sample, ]   # create train set
          test  <- data[-sample, ]  # create test set
  
          train_factor_only <- Filter(is.factor, train) # df containing only 'train' Factors as columns
          test_factor_only <- Filter(is.factor, test)   # df containing only 'test' Factors as columns
  
          haveFactorsWithExistingLevels <- NULL
          for (i in ncol(train_factor_only)){ # for each column (i.e. factor variable)
            names_train <- names(summary(train_factor_only[,i]))  # get names of all existing levels in this factor column for 'train'
            names_test <- names(summary(test_factor_only[,i]))    # get names of all existing levels in this factor column for 'test'
    
            symmetric_diff <- union(setdiff(names_train,names_test), setdiff(names_test,names_train)) # get the symmetric difference between the two factor columns (from 'train' and 'test')
            if (length(symmetric_diff) == 0)  # if no elements in the symmetric difference set then it means that both have the same levels present at least once
              haveFactorsWithExistingLevels <- c(haveFactorsWithExistingLevels, TRUE) # append logic TRUE
            else # if some elements in the symmetric difference set then it means that one of the two sets (train, test) has levels the other doesn't and it will eventually flag up when using function predict()
              haveFactorsWithExistingLevels <- c(haveFactorsWithExistingLevels, FALSE) # append logic FALSE
          }
  
          if(all(haveFactorsWithExistingLevels))
            break # break out of the repeat loop because we found a split that has factor levels existing in both 'train' and 'test' sets, for all Factor variables
        }
    }

    return (list( "train" = train, "test" = test))
}

Use like:

df <- getTrainAndTestSamples_BalancedFactors(some_dataframe)
df$train # this is your train set
df$test # this is your test set

...and no more annoying errors from R!

This can be definitely improved and I am looking forward to more efficient ways of doing this in the comments below, however, one can just use the code as it is.

Enjoy!

How to split into train and test data ensuring same combinations of factors are present in both train and test?

2 Answers2

Linked