I have a dataset data
with 16 variables. One of the variables, DiseasePositive
, indicates whether someone has been positive for a disease. Its values are therefore either 0
or 1
.
What I want to do is as follows:
- Randomly select a subset of 70% of my data to train the model.
- Make sure that the train and test sets have approximately equal proportions of people with
DiseasePositive==0
and people withDiseasePositive==1
.
I read that I can use sample.split
to do the 70% thing, but I don't know how to do the second thing. How can I do this using the sample.split
function (from the caTools
package)?
What I've done is this but I'm not sure if this is how the function works:
data$spl <- sample.split(data$DiseasePositive,SplitRatio = 0.7)
train <- subset(data, data$spl==TRUE)
test <- subset(data, data$spl==FALSE)