9

This question follows another one on group weighted means: I would like to create weighted within-group averages using data.table. The difference with the initial question is that the names of the variables to be average are specified in a string vector.

The data:

df <- read.table(text= "
          region    state  county  weights y1980  y1990  y2000
             1        1       1       10     100    200     50
             1        1       2        5      50    100    200
             1        1       3      120    1000    500    250
             1        1       4        2      25    100    400
             1        1       4       15     125    150    200
             2        2       1        1      10     50    150
             2        2       2       10      10     10    200
             2        2       2       40      40    100     30
             2        2       3       20     100    100     10
", header=TRUE, na.strings=NA)

Using Roland's suggested answer from aforementioned question:

library(data.table)
dt <- as.data.table(df)
dt2 <- dt[,lapply(.SD,weighted.mean,w=weights),by=list(region,state,county)]

I have a vector with strings to determine dynamically columns for which I want the within-group weighted average.

colsToKeep = c("y1980","y1990")

But I do not know how to pass it as an argument for the data.table magic.

I tried

 dt[,lapply(
      as.list(colsToKeep),weighted.mean,w=weights),
      by=list(region,state,county)]` 

but I then get:

Error in x * w : non-numeric argument to binary operator

Not sure how to achieve what I want.

Bonus question: I'd like original columns names to be kept, instead of getting V1 and V2.

NB I use version 1.9.3 of the data.table package.

Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
Peutch
  • 763
  • 2
  • 12
  • 29

2 Answers2

11

Normally, you should be able to do:

dt2 <- dt[,lapply(.SD,weighted.mean,w=weights), 
          by = list(region,state,county), .SDcols = colsToKeep]

i.e., just by providing just those columns to .SDcols. But at the moment, this won't work due to a bug, in that weights column won't be available because it's not specified in .SDcols.

Until it's fixed, we can accomplish this as follows:

dt2 <- dt[, lapply(mget(colsToKeep), weighted.mean, w = weights), 
            by = list(region, state, county)]
#    region state county     y1980    y1990
# 1:      1     1      1  100.0000 200.0000
# 2:      1     1      2   50.0000 100.0000
# 3:      1     1      3 1000.0000 500.0000
# 4:      1     1      4  113.2353 144.1176
# 5:      2     2      1   10.0000  50.0000
# 6:      2     2      2   34.0000  82.0000
# 7:      2     2      3  100.0000 100.0000
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Bug still present or is the first approach no longer recommended? On 16 December 2015, I get this: ``Error in as.double(w) : cannot coerce type 'closure' to vector of type 'double'`` – PatrickT Dec 16 '15 at 10:49
  • 1
    bug isn't fixed yet, sorry :-(. You could do: `dt[, lapply(mget(colsToKeep), weighted.mean, w=weights), by=.(region,state,country)]. Your error seems to indicate that you're using `as.double` with a function as input (which is unrelated). – Arun Dec 16 '15 at 11:03
  • thanks. So essentially your suggestion is to use ``mget()`` rather than ``as.list(.SD)[]`` right? (I know that the dot you use after ``by=`` is a shorthand for ``list``, so that bit of the code is the same as your workaround above) (About the error message, I think I merely copy-pasted the OP's data, but didn't go via a data.frame.) – PatrickT Dec 16 '15 at 11:23
  • 1
    Yes, that's right. I'll edit it in to replace `as.list()`. I must have written it at the time since `mget()` dint work back then (which was another bug, but that we managed to fix). – Arun Dec 16 '15 at 11:25
1

I don't know data.table but have you considered using dplyr? I think that it is almost as fast as data.table

library(dplyr)
df %>% 
  group_by(region, state, county) %>% 
  summarise(mean_80 = weighted.mean(y1980, weights), 
            mean_90 = weighted.mean(y1990, weights))
Source: local data frame [7 x 5]
Groups: region, state

  region state county   mean_80  mean_90
1      1     1      1  100.0000 200.0000
2      1     1      2   50.0000 100.0000
3      1     1      3 1000.0000 500.0000
4      1     1      4  113.2353 144.1176
5      2     2      1   10.0000  50.0000
6      2     2      2   34.0000  82.0000
7      2     2      3  100.0000 100.0000
kferris10
  • 129
  • 1
  • 3
  • Thank you for your help but I need to use data.table, and also your answer does not address the new constraint of my question, which is that columns have to specified dynamically by a vector. – Peutch Sep 24 '14 at 14:37
  • My bad, I should have read your post more carefully. If you decide to switch to dplyr, [here's](http://stackoverflow.com/questions/26003574/r-dplyr-mutate-use-dynamic-variable-names) an example that might be helpful – kferris10 Sep 24 '14 at 14:43