2

I want to calculate the output sum_logloss (see below) across all levels of a factor (C1) using a data table formula. But the result is not what I expect. Here is a small example showing what I get and why I expect a different sum_logloss as outcome.

LogLoss <- function(actual, predicted, eps=0.00001) {
  predicted <- pmin(pmax(predicted, eps), 1-eps)
  -1/length(actual)*(sum(actual*log(predicted)+(1-actual)*log(1-predicted)))
}

# THIS RETURNS TOTAL LOGLOSS
TotalLogLossVector <- function(actual_vector, predicted_vector) {
sum(mapply(LogLoss, actual_vector, predicted_vector))
}

df = data.frame(C1=c(1,1,2,2,1), C2=c(4,5,4,5,5), click=c(1,0,0,1,1))
df <- data.table(df)
df
   C1 C2 click
1:  1  4     1
2:  1  5     0
3:  2  4     0
4:  2  5     1
5:  1  5     1
df[,list(mean_CTR=mean(click),count=.N, sum_logloss=TotalLogLossVector(click,rep(mean_CTR,.N)) ),by=C1]
   C1  mean_CTR count sum_logloss
1:  1 0.6666667     3    3.663061
2:  2 0.5000000     2    1.928626

LogLoss(1,0.6666667)
[1] 0.4054651
LogLoss(0,0.6666667)
[1] 1.098612
TotalLogLossVector(c(1,0,1), c(0.6666667,0.6666667,0.6666667))
[1] 1.909543

so sum_logloss for C1=1 should be 2 * LogLoss(1,0.6666667) + 1 * LogLoss(0,0.6666667) = 1.909543, and not 3.663061.

Timothée HENRY
  • 14,294
  • 21
  • 96
  • 136
  • tucson, I see that you've not accepted a few answers under the data.table tag: [Q1](http://stackoverflow.com/q/24997556/559784), [Q2](http://stackoverflow.com/q/23760455/559784), [Q3](http://stackoverflow.com/q/23474094/559784), [Q4](http://stackoverflow.com/q/23471316/559784). I don't see any issues that you've followed up with. Any particular reason you've not accepted? Also you seem to have removed the accepted answer from akrun's here... Just wondering. – Arun Dec 30 '14 at 13:50
  • @Arun Yes, my bad, I often want to make double-sure the answer is correct and sometimes do not take the time to come back and validate. – Timothée HENRY Dec 30 '14 at 13:53
  • tucson, I see. That's nice, but it'd be great if you could follow up (if you've to) and close those questions if they indeed answer your Q. Thanks. – Arun Dec 30 '14 at 13:58

2 Answers2

3

A small note: I'd recommend setDT() to convert data.frames to data.tables, especially if you're assigning the data.table back to the same variable.


@akrun's answer is great, but it groups two times, which I find is unnecessary. Here's how I'd do it:

setDT(df)[, {
    tmp = mean(click);
    list(mean_CTR = tmp, count = .N, sum_logloss = 
         TotalLogLossVector(click, tmp))}, by=C1]
Arun
  • 116,683
  • 26
  • 284
  • 387
2

You could try

 df[, paste0('V', 1:2):=list(mean(click), .N), by=C1][,
    list(mean_CTR=V1[1L], count=V2[1L], sum_logloss=
              TotalLogLossVector(click, V1)), by=C1]

 #  C1  mean_CTR count sum_logloss
 #1:  1 0.6666667     3    1.909543
 #2:  2 0.5000000     2    1.386294
akrun
  • 874,273
  • 37
  • 540
  • 662