How to compute weighted means of a vector within factor levels?

Question

I am able to successfully get a simple mean of a given vector within factor levels, but in attempting to take it to the next step of weighting the observations, I can't get it to work. This works:

> tapply(exp.f,part.f.p.d,mean)
    1         2         3         4         5         6         7        8             9        10 
0.8535996 1.1256058 0.6968142 1.4346451 0.8136110 1.2006801 1.6112160 1.9168835     1.5135006 3.0312460

But this doesn't:

> tapply(exp.f,part.f.p.d,weighted.mean,b.pct)
Error in weighted.mean.default(X[[1L]], ...) : 
  'x' and 'w' must have the same length
>

In the code below, I am trying to find the weighted mean of exp.f, within levels of the factor part.f.p.d, weighted by the observations within b.pct that are in each level.

b.exp <- tapply(exp.f,part.f.p.d,weighted.mean,b.pct)

Error in weighted.mean.default(X[[1L]], ...) : 
  'x' and 'w' must have the same length

I am thinking I must be supplying the incorrect syntax, as all 3 of these vectors are the same length:

> length(b.pct)
[1] 978
> length(exp.f)
[1] 978
> length(part.f.p.d)
[1] 978

What is the correct way to do this? Thank you in advance.

Hi jonw- exp.f is a numeric vector of stock expected returns,part.f.p.d is a factor with 10 levels, and b.pct are percentages for each stock in an index (the top 1000 stocks) — user297400, Feb 01 '11 at 18:47
See answers to http://stackoverflow.com/questions/3685492/r-speeding-up-group-by-operations. — Charles, Feb 01 '11 at 18:51

Joshua Ulrich · Accepted Answer · 2011-02-01T20:28:42.927

8

Now I do it like this (thanks to Gavin):

sapply(split(Data,Data$part.f.p.d), function(x) weighted.mean(x$exp.f,x$b.pct)))

Others likely use ddply from the plyr package:

ddply(Data, "part.f.p.d", function(x) weighted.mean(x$exp.f, x$b.pct))

edited Feb 01 '11 at 20:28

answered Feb 01 '11 at 18:53

Joshua Ulrich

173,410
32
338
418

@Prasad: I knew the obligatory plyr solution would get some up-votes. ;-) – Joshua Ulrich Feb 01 '11 at 19:05
thank you - I can see that I need to invest some time learning what plyr is all about. cheers. – user297400 Feb 01 '11 at 19:27
1

@Joshua the `do.call` is a bit of extra overkill here. `sapply(split(Data, Data$part.f.p.d), function(x) weighted.mean(x$exp.f,x$b.pct))` would be sufficient to return a vector of weighted means. The simplicity of your `split` approach (+1) is hidden by the `rbind`+`do.call` wrapping. – Gavin Simpson Feb 01 '11 at 19:39
1

Why the plyr love-in? ;-) I agree it is a very nice package, but such simple problems as that posed in the Q can be handled very nicely via basic R functionality without needing to learn a new package. – Gavin Simpson Feb 01 '11 at 19:43
@Gavin: The `do.call(rbind, ...)` stuff is just habit that works on more general problems. You're right that `sapply` is much nicer in this case. – Joshua Ulrich Feb 01 '11 at 20:05
@Gavin the idea behind plyr is that by being more consistent you don't need to remember the bewildering array of aggregation functions in base r, let alone their inconsistent parameters. – hadley Feb 02 '11 at 01:31
@hadley I guess having grown up in a world before plyr the various base R options and arguments are burned into my synapses - probably as a result of having to remember all the various functions and inconsistent arguments :-) – Gavin Simpson Feb 02 '11 at 09:29

J. Win. · Answer 2 · 2011-02-01T19:01:15.837

I've recreated the error with some dummy data. I'm assuming that part.f.p.d is some kind of factor that you're using to separate the other vectors.

b.pct <- sample(1:100, 10) / 100
exp.f <- sample(1:1000, 10)
part.f.p.d <- factor(rep(letters[1:5], 2))

tapply(exp.f, part.f.p.d, mean) # this works
tapply(exp.f, part.f.p.d, weighted.mean, w = b.pct) # this doesn't

A call to traceback() helps to uncover the problem. The reason the second doesn't work is because the INDEX argument (ie part.f.p.d) that you passed to tapply() is used to split the X argument (ie exp.f) into smaller vectors. Each of these splits is applied to weighted.mean() together with the w argument (ie b.pct), which was not split.

EDIT: This should do what you want.

sapply(levels(part.f.p.d), 
       function(whichpart) weighted.mean(x = exp.f[part.f.p.d == whichpart], 
                                         w = b.pct[part.f.p.d == whichpart]))

thank you - is there some tweak that would make this work to calculate a weighted.mean that you know of? — user297400, Feb 01 '11 at 18:47

rbtgde · Answer 3 · 2011-02-01T18:53:46.960

Your problem is that tapply does not "split" the extra arguments supplied (through its ... arguments) to the function, as it does for the main argument X. See the 'Note' on the help page for tapply (?tapply).

Optional arguments to FUN supplied by the ... argument are not divided into cells. It is therefore inappropriate for FUN to expect additional arguments with the same length as X.

Here is a hacky solution.

exp.f <- rnorm(10)
part.f.p.d <- factor(sample(1:5, size = 10, replace = T))
b.pct <- rnorm(10)
a <- split(exp.f, part.f.p.d)
b <- split(b.pct, part.f.p.d)
lapply(seq_along(a), function(i){
  weighted.mean(a[[i]], b[[i]])
})

How to compute weighted means of a vector within factor levels?

3 Answers3