0

I am building a model for HR attrition data. My target variable is Attrition (which has fields YES/NO). My requirement is to consider Dev & Hold-Out sample as 70% of the Population. Some records in Hold-out and Dev can overlap. I have used catools library for the split and found that it's not working as expected. Please see the below output. Any quick help is really appreciated.

R code :

 CTDF = read.table("HR_Employee_Attrition_Data.csv", sep = ",", header = T)
 nrow(CTDF)
 table(CTDF $Attrition)
 library(caTools)
 set.seed(100)
 split = sample.split(CTDF$Attrition,SplitRatio=0.70)
 CTDF.dev=subset(CTDF, split=TRUE) 
 table(CTDF$Attrition)

Output :

> nrow(CTDF)
 [1] 2940
> str(CTDF)
 $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 
> table(CTDF$Attrition) 
 No  Yes 
 2466  474 

After split :

> table(CTDF$Attrition)

  No  Yes 
 2466  474 
  1. Please comment Whether my approach to split the Dev and Hold out is correct ? Note that I don't need test sampling data.
  2. How should I make my sample.split to work in this scenario ?
Jean
  • 1,480
  • 15
  • 27
  • Your output and after split are the same... A variety of methods listed [here](http://stackoverflow.com/questions/17200114/how-to-split-data-into-training-testing-sets-using-sample-function-in-r-program) – Jean Feb 14 '17 at 01:32
  • HI, Thanks for your response. It works now. However I still want to know how I would get the hold out sample overlapping with some of the entries in the development sample. Appreciate your help. – Venugopal Sandepudi Feb 14 '17 at 05:17

0 Answers0