2

Trying to get weighted mean for a couple of categories want to use by(df$A,df$B,function(x) weighted.mean(x,df$C)) This doesn't work of course. Is there a way to do this using by() and weighted.mean()

 df= data.frame(A=c(1,4,56,4,3),B=c('hi','gb','hi','gb','yo'),C=c(5,2,4,1,3))

 by(df$A,df$B,function(x) weighted.mean(x,df$C)) #doesn't work

I have a bunch of work arounds but it would so simple if I could just use that format.

talat
  • 68,970
  • 21
  • 126
  • 157
Monal
  • 127
  • 1
  • 6
  • Why should it? `by` applies functions to subsets of a dataset, and you're trying to pass a weight vector that's the length of the original dataframe's number of observations. – Thomas Dec 22 '14 at 19:24
  • Why must you use `by`? This is easy with package plyr (or data.table or dplyr): `library(plyr); ddply(df, .(B), summarise, wm = weighted.mean(A, C))` – Roland Dec 22 '14 at 19:30
  • Another option is `lapply(split(df, df$B), function(x) weighted.mean(x$A, x$C))` since it seems you don't mind a list as output. – talat Dec 22 '14 at 19:47

3 Answers3

4

Here's a simple and efficient solution using data.table

library(data.table)
setDT(df)[, .(WM = weighted.mean(A, C)), B]
#     B       WM
# 1: hi 25.44444
# 2: gb  4.00000
# 3: yo  3.00000

Or using split and apply combination from base R

sapply(split(df, df$B), function(x) weighted.mean(x$A, x$C))
#      gb       hi       yo 
# 4.00000 25.44444  3.00000 

Or

library(dplyr)
df %>%
  group_by(B) %>%
  summarise(WM = weighted.mean(A, C))
# Source: local data frame [3 x 2]
# 
# B       WM
# 1 gb  4.00000
# 2 hi 25.44444
# 3 yo  3.00000
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks for all the options. I thought I had tried the split option but didn't work...Works. – Monal Dec 25 '14 at 19:02
  • In general, I would avoid using `by` for two reasons: It is VERY slow, second reason is- that you can't use its output as it and you need to find a way to convert it back to a data frame. – David Arenburg Dec 25 '14 at 21:32
  • Thanks for the advice. I was writing it to a text file and the layout of `by` looks better in that case. Is `sapply(split()` much faster? – Monal Dec 26 '14 at 22:47
  • Don't know, you will have to benchmark I guess if it is really important to you. – David Arenburg Dec 27 '14 at 18:47
  • I posted a related question: https://stackoverflow.com/questions/71183209 – PatrickT Feb 19 '22 at 08:11
4

You need to pass the weights along with the values to be averaged in by():

by(df[c("A","C")], df$B, function(x) weighted.mean(x$A, x$C))
# df$B: gb
# [1] 4
# ------------------------------------------------------------ 
# df$B: hi
# [1] 25.44444
# ------------------------------------------------------------ 
# df$B: yo
# [1] 3
konvas
  • 14,126
  • 2
  • 40
  • 46
  • This is exactly what I was after. I also wanted to use [] to specify column numbers. `by(df[,c(1,3)], df[,2], function(x) weighted.mean(x[,1], x[,3]))` In my real dataframe the original columns were in the 30's and but in the weighted.mean part they were 1,2. like `by(charlie[c(9,33,35)], charlie[,4], function(x) weighted.mean(x[,3]/x[,2], x[,1], na.rm = TRUE))`. – Monal Dec 25 '14 at 18:57
3

Or simply recreate the calculation used by weighted.mean():

by(df,df$B,function(df)with(df,sum(A*C)/sum(C)))

df$B: gb
[1] 4
------------------------------------------------------------ 
df$B: hi
[1] 25.44444
------------------------------------------------------------ 
df$B: yo
[1] 3
talat
  • 68,970
  • 21
  • 126
  • 157
Stephan Kolassa
  • 7,953
  • 2
  • 28
  • 48