6

I am building a logistic regression model in R. I want to bin continuous predictors in an optimal way in relationship to the target variable. There are two things that I know of:

  1. the continuous variables are binned such that its IV (information value) is maximized

  2. maximize the chi-square in the two way contingency table -- the target has two values 0 and 1, and the binned continuous variable has the binned buckets

Does anyone know of any functions in R that can perform such binning?

Your help will be greatly appreciated.

Matthew Murdoch
  • 30,874
  • 30
  • 96
  • 127
Michael
  • 61
  • 1
  • 1
  • 2
  • 5
    Before you bin a continuous variable, read Frank Harrell's objections: http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous – Richie Cotton Aug 11 '11 at 09:56
  • Still I think this is one way of doing it. There are advantages and disadvantages. Modeling as continuous variables can also have drawbacks. – Michael Aug 12 '11 at 00:47
  • 1
    It is a bit of archaeology, but could you @Michael elaborate a bit on drawbacks of using continuous variables as continuous variables? Or share anything you have learnt through those 3 years since the comment above. – m-dz Jul 22 '16 at 19:26

3 Answers3

5

For the first point, you could bin using the weight of evidence (woe) with the package woebinning which optimizes the number of bins for the IV

library(woeBinning)

# get the bin cut points from your dataframe
cutpoints <- woe.binning(dataset, "target_name", "Variable_name")
woe.binning.plot(cutpoints)

# apply the cutpoints to your dataframe
dataset_woe <- woe.binning.deploy(dataset, cutpoint, add.woe.or.dum.var = "woe")

It returns your dataset with two extra columns

  • Variable_name.binned which is the labels
  • Variable_name.woe.binned which is the replaced values that you can then parse into your regression instead of Variable_name

For the second point, on chi2, the package discretization seems to handle it but I haven't tested it.

R. Prost
  • 1,958
  • 1
  • 16
  • 21
3

The methods used by regression splines to set knot locations might be considered. The rpart package probably has relevant code. You do need to penalize the inferential statistics because this results in an implicit hiding of the degrees of freedom expended in the process of moving the breaks around to get the best fit. Another common method is to specify breaks at equally spaced quantiles (quartiles or quintiles) within the subset with IV=1. Something like this untested code:

cont.var.vec <- # names of all your continuous variables
breaks <- function(var,n) quantiles( dfrm[[var]], 
                                     probs=seq(0,1,length.out=n), 
                                     na.rm=TRUE)
lapply(dfrm[ dfrm$IV == 1 , cont.var.vec] , breaks, n=5)
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Could you explain your code? If I have a data frame df in which y in binary and x1, x2, are continuous predictiors and x3 it catogorical, how do I apply your code to bin x1 and x2? – Michael Aug 11 '11 at 22:36
  • when you say "specify breaks at equally spaced quantiles (quartiles or quintiles) within the subset with IV=1", do you mean that x1 and x2 are binned into four quantiles using the rows where y=1? – Michael Aug 11 '11 at 22:39
  • Sorry about the mulpiple messages, I hit return, it just went out. Thanks – Michael Aug 11 '11 at 22:40
  • I was implicitly assuming that "1" was the target category with the lower count. The breaks are set to roughly get n==5-1 equal sized groups, and the same breaks are then used in the larger sized target group. – IRTFM Aug 12 '11 at 07:07
-4

s

etwd("D:")
rm(list=ls())
options (scipen = 999)
read.csv("dummy_data.txt") -> dt

head(dt)
summary(dt)
mydata <- dt
head(mydata)
summary(mydata)
##Capping
for(i in 1:ncol(mydata)){
  if(is.numeric(mydata[,i])){
    val.quant <- unname(quantile(mydata[,i],probs = 0.75))
    mydata[,i] = sapply(mydata[,i],function(x){if(x > (1.5*val.quant+1)){1.5*val.quant+1}else{x}})
  }
}

library(randomForest)
x <- mydata[,!names(mydata) %in% c("Cust_Key","Y")]
y <- as.factor(mydata$Y)

set.seed(21)
fit <- randomForest(x,y,importance=T,ntree = 70)

mydata2 <- mydata[,!names(mydata) %in% c("Cust_Key")]
mydata2$Y <- as.factor(mydata2$Y)
fit$importance
####var reduction#####
vartoremove <- ncol(mydata2) - 20
library(rminer)
##### 
for(i in 1:vartoremove){
  rf <- fit(Y~.,data=mydata2,model = "randomForest", mtry = 10 ,ntree = 100)
  varImportance <- Importance(rf,mydata2,method="sensg")
  Z <- order(varImportance$imp,decreasing = FALSE)
  IND <- Z[2]
  var_to_remove <- names(mydata2[IND])
  mydata2[IND] = NULL
  print(i)
}
###########
library(smbinning)
as.data.frame(mydata2) -> inp
summary(inp)
attach(inp)
rm(result)
str(inp)
inp$target <- as.numeric(inp$Y) *1
table(inp$target)
ftable(inp$Y,inp$target)
inp$target <- inp$target -1
result= smbinning(df=inp, y="target",  x="X37", p=0.0005) 
result$ivtable
smbinning.plot(result,option="badrate",sub="test")
summary(inp)
result$ivtable
boxplot(inp$X2~inp$Y,horizontal=T, frame=F, col="red",main="Distribution")
###Sample
require(caTools)
inp$Y <- NULL
sample = sample.split(inp$target, SplitRatio = .7)
train = subset(inp, sample == TRUE)
test = subset(inp, sample == FALSE)
head(train)
nrow(train)

fit1 <- glm(train$target~.,data=train,family = binomial)  

summary(rf)
prediction1 <- data.frame(actual = test$target, predicted = predict(fit1,test ,type="response") )

result= smbinning(df=prediction1, y="actual",  x="predicted", p=0.005) 
result$ivtable

smbinning.plot(result,option="badrate",sub="test")

tail(prediction1)

write.csv(prediction1 , "test_pred_logistic.csv")
predict_train <- data.frame(actual = train$target, predicted = predict(fit1,train ,type="response") )
write.csv(predict_train , "train_pred_logistic.csv")
result= smbinning(df=predict_train, y="actual",  x="predicted", p=0.005) 
result$ivtable
smbinning.plot(result,option="badrate",sub="train")


####random forest

rf <- fit(target~.,data=train,model = "randomForest", mtry = 10 ,ntree = 200)

prediction2 <- data.frame(actual = test$target, predicted = predict(rf,train))
result= smbinning(df=prediction2, y="actual",  x="predicted", p=0.005) 
result$ivtable
smbinning.plot(result,option="badrate",sub="train")











###########IV

library(devtools)
install_github("riv","tomasgreif")
library(woe)

##### K-fold Validation ########

library(caret)
cv_fold_count = 2
folds = createFolds(mydata2$Y,cv_fold_count,list=T);

smpl = folds[[i]];
g_train = mydata2[-smpl,!names(mydata2) %in% c("Y")];
g_test = mydata2[smpl,!names(mydata2) %in% c("Y")];

cost_train = mydata2[-smpl,"Y"];
cost_test = mydata2[smpl,"Y"];

rf <- randomForest(g_train,cost_train)
logit.data <- cbind(cost_train,g_train)
logit.fit <- glm(cost_train~.,data=logit.data,family = binomial)

prediction <- data.f

rame(actual = test$Y, predicted = predict(rf,test))

MUMGO
  • 1
  • 3
    Welcome to SO! You need to provide more explanation to your answers. See [How to answer page](http://stackoverflow.com/help/how-to-answer) for help in improving your answer. – Madness Aug 08 '15 at 20:38