0

I am working on an example from http://r-statistics.co/Logistic-Regression-With-R.html. I have problem with smbinning code. I am trying to get Information Value via using smbinning.

library(smbinning)
# segregate continuous and factor variables
factor_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY")
continuous_vars <- c("AGE", "FNLWGT","EDUCATIONNUM", "HOURSPERWEEK", "CAPITALGAIN", "CAPITALLOSS")

iv_df <- data.frame(VARS=c(factor_vars, continuous_vars), IV=numeric(14))  # init for IV results

# compute IV for categoricals
for(factor_var in factor_vars){
  smb <- smbinning.factor(trainingData, y="ABOVE50K", x=factor_var)  # WOE table
  if(class(smb) != "character"){ # heck if some error occured
    iv_df[iv_df$VARS == factor_var, "IV"] <- smb$iv
  }
}

This is the code given. I cannot understand the reason behind checking class of the smbinning. My general understanding on smbinning is also not that good.

for(vars in factor_vars){
 smb <- smbinning.factor(trainingData, y = "ABOVE50K", x = vars )
 iv_df[iv_df$VARS == vars, "IV"] <- smb["iv"]
}

When I run this code I am getting some values NA values. So class checking is apparently needed but why?

Thank you very much.

boyaronur
  • 521
  • 6
  • 18

1 Answers1

0

Following the example to the letter, your problem would be the following:

  1. If you do smb <- smbinning.factor(trainingData, y="ABOVE50K", x="EDUCATION") and then smb, you get

1 "Too many categories"

  1. str(trainingData) shows that:

$ EDUCATION : Factor w/ 16 levels...

  1. While the smbinning documentation says that

maxcat - Specifies the maximum number of categories. Default value is 10. Name of x must not have a dot.

  1. Therefore your solution is to use: smb <- smbinning.factor(trainingData, y="ABOVE50K", x=factor_var, maxcat=16) in the for loop
Borislav Aymaliev
  • 803
  • 2
  • 9
  • 20