0

I am new to R programming and trying to learn part time so apologize for naive coding and questions in advance. I have spent about 1 day trying to figure out code for this and unable to do so hence asking here.

https://www.kaggle.com/c/titanic/data?select=train.csv

I am working on train Titanic Data set from Kaggle imported as train_data. I have cleaned up all the col and also converted them to factor where needed.

My question is 2 fold:

1. Unable to understand why this formula gives IV values as 0 for everything. What have I done wrong?

factor_vars <- colnames(train_data) 
all_iv <- data.frame(VARS=factor_vars, IV=numeric(length(factor_vars)),STRENGTH=character(length(factor_vars)),stringsAsFactors = F)
for (factor_var in factor_vars){


all_iv[all_iv$VARS == factor_var, "IV"] <- 
InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived)


 all_iv[all_iv$VARS == factor_var, "STRENGTH"] <- 
attr(InformationValue::IV(X=train_data[, factor_var], Y=train_data$Survived), "howgood")
}

all_iv <- all_iv[order(-all_iv$IV), ]

2. I am trying to create my own function to calculate IV values for multiple columns in 1 go so that I do not have to do repetitive task however when I run the following formula I get count of total 0 and total 1 instead of items grouped by like I requested. Again, what is that I am doing wrong in this example?

train_data %>% group_by(train_data[[3]]) %>%


summarise(zero = sum(train_data[[2]]==0),
one = sum(train_data[[2]]==1))

I get output

               zero   one
           
1                   549   342
2                   549   342
3                   549   342

where as I would anticipate an answer like:

    zero one



 1  80 136



 2  97  87



3 372 119

what is wrong with my code?

3. Is there any pre built function which can give IV values for all columns? On searching I found iv.mult function but I can not get it to work. Any suggestion would be great.

Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • (1) Please take a look at how to make a [great reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). (2) What do you expect `IV` to be? – Martin Gal Sep 29 '21 at 17:52
  • @MartinGal: Thank you for your suggestion on the reproducible example. I have gone through it and next time I will make sure that I try to follow as much as possible however in this case, I wouldn't know how to attach CSV data file. – Gaurav Arora Sep 30 '21 at 09:12

1 Answers1

0

Let's take a look at your questions:

1.

length(factor_vars)
#> [1] 12

length() returns the number of elements of your vector factor_vars. So your code numeric(length(factor_vars)) is evaluated to numeric(12) which returns an numeric vector of length 12, default filled with zeros.

The same applies to character(length(factor_vars)) which returns a character vector of length 12 filled with empty strings "".

  1. Your code doesn't use a correct dplyr syntax.
library(dplyr)

library(dplyr)

train_data %>% 
  group_by(Pclass) %>%
  summarise(zero = sum(Survived == 0),
            one = sum(Survived == 1))

returns

# A tibble: 3 x 3
  Pclass  zero   one
   <dbl> <int> <int>
1      1    80   136
2      2    97    87
3      3   372   119

which is most likely what you are looking for.

  1. Don't know the meaning of IV.
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • Thank you for taking time and guiding me on my query :) IV means Information Value. I am currently using library(InformationValue) and command IV. However that only lets me assess IV for 1 independent variable at a time and I do not want to type in code for 15 independent variable hence I am trying to find a library / command that would let me create a table of all independent variable against dependant variable in 1 go. – Gaurav Arora Sep 30 '21 at 09:15