1

I have a dataset data with 16 variables. One of the variables, DiseasePositive, indicates whether someone has been positive for a disease. Its values are therefore either 0 or 1.

What I want to do is as follows:

  1. Randomly select a subset of 70% of my data to train the model.
  2. Make sure that the train and test sets have approximately equal proportions of people with DiseasePositive==0 and people with DiseasePositive==1.

I read that I can use sample.split to do the 70% thing, but I don't know how to do the second thing. How can I do this using the sample.split function (from the caTools package)?

What I've done is this but I'm not sure if this is how the function works:

data$spl <- sample.split(data$DiseasePositive,SplitRatio = 0.7)
train    <- subset(data, data$spl==TRUE)
test     <- subset(data, data$spl==FALSE)
Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
azura
  • 81
  • 5
  • Similar question and answer: https://stackoverflow.com/questions/57924068/how-to-get-around-error-factor-has-new-levels-in-cross-validation-glm/57937180#57937180 – Dave2e Oct 01 '19 at 19:57

2 Answers2

0

Here is a custom-made R solution:

stratified.sample <- function(var, p) {
  obs  <- seq_along(var)
  grps <- unique(var)
  inds <- numeric()
  for(g in grps) {
    inds <- c(inds, sample(obs[var==g], floor(sum(var==g)*p)))
  }
  inds
}

You can use the above function to stratify into test and train for any variable, even if it has more than 2 levels. Here is a demonstration using iris:

tinds <- stratified.sample(iris$Species, 0.7)
train <- iris[tinds,]
test  <- iris[-tinds,]

Make sure that the class balances were preserved:

table(train$Species)
table(test$Species)

Using sample.split and your data:

inds  <- sample.split(data$DiseasePositive, SplitRatio = 0.7)
train <- data[inds,]
test  <- data[!inds,]
Karolis Koncevičius
  • 9,417
  • 9
  • 56
  • 89
  • what about using sample.split? – azura Oct 01 '19 at 19:52
  • also can you show me how to use this function using the information about this dataset as an example? I'm a bit lost – azura Oct 01 '19 at 19:54
  • The state of StackOverflow these days... The example you provided in your edit is correct. You can remove `data$spl==TRUE` part and replace it with only `data$spl`. Also to understand more about the function you are using you can read `help(sample.split)` – Karolis Koncevičius Oct 01 '19 at 19:58
  • @azura added an example with `sample.split` on the data you described. – Karolis Koncevičius Oct 01 '19 at 20:00
  • Oh really! So can you explain to me how it works? When I specify that variable in the function does it automatically split the data according to that variable? So basically does it split it into equal proportions of the variable==0 and ==1 in the training and test sets? I was worried it just sectioned it off without regard for the values – azura Oct 01 '19 at 20:00
  • @azura - The best way to be sure is - after you make the split, inspect `train` and `test` datasets and confirm yourself that the `DiseasePositive` has the same ratio of `1s` and `0s` in both parts. – Karolis Koncevičius Oct 01 '19 at 20:02
  • @azura - if you are using `sample.split` you don't need to use the function provided in my answer, use `sample.split` instead. They both do the same thing. – Karolis Koncevičius Oct 01 '19 at 20:03
  • How can I inspect to make sure the proportions are equal? – azura Oct 01 '19 at 20:06
  • Compare `table(train$DiseasePositive)` and `table(test$DiseasePositive)`. Or just look at the rows with something like `edit(train)`. – Karolis Koncevičius Oct 01 '19 at 20:07
0

Read this.

In short --- use createDataPartition() from caret package with "factorization" like this:

# create data
p <- 0.7
df <- data.frame(x = rnorm(100100),
                 y = c(rep(0, 100000), rep(1, 100)))

# create sample
sample <- caret::createDataPartition(as.factor(df$y), p = p, list = F)

# create train/test sets
train <- df[sample, ]
test <- df[-sample, ]

# check train set
table(train$y)

# check without "factorization"
table(df$y[caret::createDataPartition(df$y, 
                                      p = p, 
                                      list = F)])

Output:

# table(train$y)
    0     1 
70000    70 

# table(df$y[caret::createDataPartition(df$y, 
# +                                     p = p, 
# +                                     list = F)])
    0     1 
69998    72 
roma
  • 25
  • 8