0

I have quite some categorical variable in my dataset, These variables have more than two levels each. Now i want an R code function (or loop) that can calculate the entropy and information gain for each levels in each categorical variable and return the lowest entropy and highest information gain.

data <- list(buys = c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no"),credit = c("fair", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "excellent"),student = c("no", "no", "no","no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes", "no"),income = c("high", "high", "high", "medium", "low", "low", "low", "medium", "low", "medium", "medium", "medium", "high", "medium"),age = c(25, 27, 35, 41, 48, 42, 36, 29, 26, 45, 23, 33, 37, 44))
data<- as.data.frame(data)

Above is a sample dataframe

entropy_tab <- function(x) { tabfun2 <- prop.table(table(data[,x],training_credit_Risk[,13]) + 1e-6, margin = 1)sum(prop.table(table(data[,x]))*rowSums(-tabfun2*log2(tabfun2)))}

Above function calculates entropy for each variable, i want a fuction to calculate the contribution to the entropy for each level? i.e the contribution of "excellent" and "fair" to the entropy of "Credit"

highclef
  • 169
  • 7
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Dec 12 '22 at 17:58
  • It's not clear what you're asking that the linked question doesn't already answer. That question uses categorical variables. – James_D Dec 12 '22 at 18:06
  • please could you check this link? https://stackoverflow.com/questions/32993750/to-calculate-the-entropy In my case, i want the entropy of "high", "medium", "low" under the variable "income" – highclef Dec 12 '22 at 18:07
  • @James_D thank you, yes but what I want is that I get the entropy of each levels under one categorical variable, e.g entropy of "high", "medium", "low" under the categorical variable "Income" – highclef Dec 12 '22 at 18:08
  • 1
    please share your data with `dput(yourdata)` and code you have tried. I realized you linked to other questions but this information will be helpful. – Mike Dec 12 '22 at 18:09
  • 1
    The entropy is `sum(-p*log(p))` where `p` is the proportion of the data set taking on each value (level), and the sum is over all levels. So I guess you mean the contribution to the entropy for each level? As suggested, edit your question to include a sample data set and the output you expect to get from that data set. – James_D Dec 12 '22 at 18:18
  • Thank you @James_D I just did that. – highclef Dec 12 '22 at 18:37

2 Answers2

1

In measure theory, the expected surprisal of an event A in a measure space with measure mu is

-mu(A)log(mu(A))

And so the entropy is the sum over all events of the expected surprisal. So what you're looking for is the expected surprisal of each level of each variable.

Note you won't be able to express the surprisal of a data frame as a data frame, as each variable in the data frame has a different number of levels.

You can do

exp_surprisal <- function(x, base=exp(1)) {
  t <- table(x)
  freq <- t/sum(t)
  ifelse(freq==0, 0, -freq * log(freq, base))
}

And then

lapply(data, exp_surprisal)

gives

$buys
x
       no       yes 
0.3677212 0.2840353 

$credit
x
excellent      fair 
0.3631277 0.3197805 

$student
x
       no       yes 
0.3465736 0.3465736 

$income
x
     high       low    medium 
0.3579323 0.3579323 0.3631277 

$age
x
       23        25        26        27        29        33        35        36        37        41        42        44        45        48 
0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 0.1885041 

Note you can also define

entropy <- function(x) sum(exp_surprisal(x))

to get the entropy.

Then

lapply(data, entropy)

gives

$buys
[1] 0.6517566

$credit
[1] 0.6829081

$student
[1] 0.6931472

$income
[1] 1.078992

$age
[1] 2.639057
James_D
  • 201,275
  • 16
  • 291
  • 322
0

You have to modify your function to have two inputs, the variable you want and the level of the variable. Inside the function you then have to subset based on the level of the variable you want. I then use mapply to loop through the variable credit and each of its levels.

entropy_tab <- function(x,y) { 
  tabfun2 <- prop.table(table(data[,x][data[,x] == y] ,data[,5][data[,x]==y]) + 1e-6, margin = 1)
sum(prop.table(table(data[,x][data[,x] == y]))*rowSums(-tabfun2*log2(tabfun2)))
}


x <- mapply(entropy_tab, c("credit","credit"), unique(data$credit))

names(x) <- unique(data$credit)

#checks
entropy_tab("credit","excellent")
entropy_tab("credit","fair")
Mike
  • 3,797
  • 1
  • 11
  • 30