I am building a model for HR attrition data. My target variable is Attrition (which has fields YES/NO). My requirement is to consider Dev & Hold-Out sample as 70% of the Population. Some records in Hold-out and Dev can overlap. I have used catools
library for the split and found that it's not working as expected. Please see the below output. Any quick help is really appreciated.
R code :
CTDF = read.table("HR_Employee_Attrition_Data.csv", sep = ",", header = T)
nrow(CTDF)
table(CTDF $Attrition)
library(caTools)
set.seed(100)
split = sample.split(CTDF$Attrition,SplitRatio=0.70)
CTDF.dev=subset(CTDF, split=TRUE)
table(CTDF$Attrition)
Output :
> nrow(CTDF)
[1] 2940
> str(CTDF)
$ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1
> table(CTDF$Attrition)
No Yes
2466 474
After split :
> table(CTDF$Attrition)
No Yes
2466 474
- Please comment Whether my approach to split the Dev and Hold out is correct ? Note that I don't need test sampling data.
- How should I make my sample.split to work in this scenario ?