Weighted mean with by function

Question

Trying to get weighted mean for a couple of categories want to use by(df$A,df$B,function(x) weighted.mean(x,df$C)) This doesn't work of course. Is there a way to do this using by() and weighted.mean()

 df= data.frame(A=c(1,4,56,4,3),B=c('hi','gb','hi','gb','yo'),C=c(5,2,4,1,3))

 by(df$A,df$B,function(x) weighted.mean(x,df$C)) #doesn't work

I have a bunch of work arounds but it would so simple if I could just use that format.

Why should it? `by` applies functions to subsets of a dataset, and you're trying to pass a weight vector that's the length of the original dataframe's number of observations. — Thomas, Dec 22 '14 at 19:24
Why must you use `by`? This is easy with package plyr (or data.table or dplyr): `library(plyr); ddply(df, .(B), summarise, wm = weighted.mean(A, C))` — Roland, Dec 22 '14 at 19:30
Another option is `lapply(split(df, df$B), function(x) weighted.mean(x$A, x$C))` since it seems you don't mind a list as output. — talat, Dec 22 '14 at 19:47

David Arenburg · Answer 1 · 2014-12-22T19:39:54.417

4

Here's a simple and efficient solution using data.table

library(data.table)
setDT(df)[, .(WM = weighted.mean(A, C)), B]
#     B       WM
# 1: hi 25.44444
# 2: gb  4.00000
# 3: yo  3.00000

Or using split and apply combination from base R

sapply(split(df, df$B), function(x) weighted.mean(x$A, x$C))
#      gb       hi       yo 
# 4.00000 25.44444  3.00000

Or

library(dplyr)
df %>%
  group_by(B) %>%
  summarise(WM = weighted.mean(A, C))
# Source: local data frame [3 x 2]
# 
# B       WM
# 1 gb  4.00000
# 2 hi 25.44444
# 3 yo  3.00000

edited Dec 22 '14 at 19:39

answered Dec 22 '14 at 19:33

David Arenburg

91,361
17
137
196

Thanks for all the options. I thought I had tried the split option but didn't work...Works. – Monal Dec 25 '14 at 19:02
In general, I would avoid using `by` for two reasons: It is VERY slow, second reason is- that you can't use its output as it and you need to find a way to convert it back to a data frame. – David Arenburg Dec 25 '14 at 21:32
Thanks for the advice. I was writing it to a text file and the layout of `by` looks better in that case. Is `sapply(split()` much faster? – Monal Dec 26 '14 at 22:47
Don't know, you will have to benchmark I guess if it is really important to you. – David Arenburg Dec 27 '14 at 18:47
I posted a related question: https://stackoverflow.com/questions/71183209 – PatrickT Feb 19 '22 at 08:11

score 4 · Accepted Answer · answered Dec 22 '14 at 19:37

4

You need to pass the weights along with the values to be averaged in by():

by(df[c("A","C")], df$B, function(x) weighted.mean(x$A, x$C))
# df$B: gb
# [1] 4
# ------------------------------------------------------------ 
# df$B: hi
# [1] 25.44444
# ------------------------------------------------------------ 
# df$B: yo
# [1] 3

answered Dec 22 '14 at 19:37

konvas

14,126
2
40
46

This is exactly what I was after. I also wanted to use [] to specify column numbers. `by(df[,c(1,3)], df[,2], function(x) weighted.mean(x[,1], x[,3]))` In my real dataframe the original columns were in the 30's and but in the weighted.mean part they were 1,2. like `by(charlie[c(9,33,35)], charlie[,4], function(x) weighted.mean(x[,3]/x[,2], x[,1], na.rm = TRUE))`. – Monal Dec 25 '14 at 18:57

score 3 · Answer 3 · edited Dec 22 '14 at 19:48

3

Or simply recreate the calculation used by weighted.mean():

by(df,df$B,function(df)with(df,sum(A*C)/sum(C)))

df$B: gb
[1] 4
------------------------------------------------------------ 
df$B: hi
[1] 25.44444
------------------------------------------------------------ 
df$B: yo
[1] 3

edited Dec 22 '14 at 19:48

talat

68,970
21
126
157

answered Dec 22 '14 at 19:43

Stephan Kolassa

7,953
2
28
48

Weighted mean with by function

3 Answers3

Linked

Related