7

I am looking for a solution to compute weighted sum of some variables by groups with data.table. I hope the example is clear enough.

require(data.table)

dt <- data.table(matrix(1:200, nrow = 10))
dt[, gr := c(rep(1,5), rep(2,5))]
dt[, w := 2]

# Error: object 'w' not found
dt[, lapply(.SD, function(x) sum(x * w)),
   .SDcols = paste0("V", 1:4)]

# Error: object 'w' not found
dt[, lapply(.SD * w, sum),
   .SDcols = paste0("V", 1:4)]

# This works with out groups
dt[, lapply(.SD, function(x) sum(x * dt$w)),
   .SDcols = paste0("V", 1:4)]

# It does not work by groups
dt[, lapply(.SD, function(x) sum(x * dt$w)),
   .SDcols = paste0("V", 1:4), keyby = gr]

# The result to be expected
dt[, list(V1 = sum(V1 * w),
          V2 = sum(V2 * w),
          V3 = sum(V3 * w),
          V4 = sum(V4 * w)), keyby = gr]

### from Aruns answer
dt[, lapply(.SD[, paste0("V", 1:4), with = F],
            function(x) sum(x*w)), by=gr]
djhurio
  • 5,437
  • 4
  • 27
  • 48

1 Answers1

6

Final attempt (copying Roland's answer :))

Copying @Roland's excellent answer:

print(dt[, lapply(.SD, function(x, w) sum(x*w), w=w), by=gr][, w := NULL])

still not the most efficient one: (second attempt)

Following @Roland's comment, it's indeed faster to do the operation on all columns and then just remove the unwanted ones (as long as the operation itself is not time consuming, which is the case here).

dt[, {lapply(.SD, function(x) sum(x*w))}, by=gr][, w := NULL][]

For some reason, w seems to be not found when I don't use {}.. No idea why though.


old (inefficient) answer:

(Subsetting can be costly if there are too many groups)

You can do this without using .SDcols and then removing it while providing it to lapply as follows:

dt[, lapply(.SD[, -1, with=FALSE], function(x) sum(x*w)), by=gr]
#    gr V1  V2  V3  V4
# 1:  1 20 120 220 320
# 2:  2 70 170 270 370

.SDcols makes .SD without the w column. So, it's not possible to multiply with w as it doesn't exist within the scope of .SD environment then.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • 2
    I suspect it might be faster (due to data.table optimization) to just lapply over all columns and simply remove the `w*w` column afterwards. – Roland Jul 19 '13 at 10:49
  • 1
    Roland, seems so. for some reason `w` is not found unless I use `{}`. I'll make an edit. Maybe you follow why... – Arun Jul 19 '13 at 10:58
  • 1
    `print(dt[, lapply(.SD, function(x, w) sum(x*w), w=w), by=gr][, w := NULL])` – Roland Jul 19 '13 at 11:04
  • that's awesome indeed. much faster as well. hadn't thought about the `lapply` optional argument to FUN. – Arun Jul 19 '13 at 11:08
  • Arun, it seems I will be using your ineffective solution. It is because I have to do the weighted sum only for small subset of all variables. I have updated the question accordingly. – djhurio Jul 19 '13 at 11:08
  • 1
    @djhurio, no you can still use it by doing: `dt[, lapply(.SD, function(x, w) sum(x*w), w=w), by=gr, .SDcols=c("w", "V1", "V2", "V3", "V4")][, w := NULL]` – Arun Jul 19 '13 at 11:11
  • 1
    @Arun, thanks! The data set in my case is not so big. So I do not see dramatic time lag with the ineffective solution. I will use it because the syntax is shorter and more readable. – djhurio Jul 19 '13 at 11:22