12

Issues regarding the command by and weighted.mean already exist but none was able to help solving my problem. I am new to R and am more used to data mining language than programming.

I have a data frame with for each individual (observation/row) the income, education level and sample weight. I want to calculate the weighted mean of income by education level, and I want the result to be associated to each individual in a new column of my original data frame, like this:

obs income education weight incomegroup
1.   1000      A       10    --> display weighted mean of income for education level A
2.   2000      B        1    --> display weighted mean of income for education level B
3.   1500      B        5    --> display weighted mean of income for education level B
4.   2000      A        2    --> display weighted mean of income for education level A

I tried:

data$incomegroup=by(data$education, function(x) weighted.mean(data$income, data$weight))    

It does not work. The weighted mean is calculated somehow and appears in column "incomegroup" but for the whole set instead of by group or for one group only, I don't know. I read things regarding packages plyr or aggregate but it does not seem to do what I am interested in.

The ave{stats} command gives exactly what I am looking for but only for simple mean:

data$incomegroup=ave(data$income,data$education,FUN = mean)

It cannot be used with weights.

Thanking you in advance for your help!

Elixterra
  • 281
  • 1
  • 3
  • 11

3 Answers3

13

If we use mutate, then we can avoid the left_join

library(dplyr)
df %>%
   group_by(education) %>% 
   mutate(weighted_income = weighted.mean(income, weight))
#    obs income education weight weighted_income
#  <int>  <int>    <fctr>  <int>           <dbl>
#1     1   1000         A     10        1166.667
#2     2   2000         B      1        1583.333
#3     3   1500         B      5        1583.333
#4     4   2000         A      2        1166.667
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    The 3 proposed solutions work. This solution (using library `dplyr` together with `mutate`) is however the most efficient. Using library `data.table` works perfectly fine as well but require to install the package and gives a warning regarding the R version: package compiled with R 3.2.5 while I'm using R 3.2.3. Thank you all for your efficiency! – Elixterra Jul 22 '16 at 08:49
  • @Elixterra Even though the code of akrun in `dplyr` appears more readible, the `data.table` approach is far more efficient. I compared the two solutions on a 14kk `data.frame` with 696k unique ID. The result is: `dplyr = user: 49.78 system: 0.01 elapsed: 50.05` `data.table = user: 7.15 system: 0.00 elapsed: 7.17` – Seymour Dec 07 '17 at 14:54
9

Try using the dplyr package as follows:

df <- read.table(text = 'obs income education weight   
                          1   1000      A       10     
                          2   2000      B        1     
                          3   1500      B        5     
                          4   2000      A        2', 
                 header = TRUE)     

library(dplyr)

df_summary <- 
  df %>% 
  group_by(education) %>% 
  summarise(weighted_income = weighted.mean(income, weight))

df_summary
# education weighted_income
#     A        1166.667
#     B        1583.333

df_final <- left_join(df, df_summary, by = 'education')

df_final
# obs income education weight weighted_income
#  1   1000         A     10        1166.667
#  2   2000         B      1        1583.333
#  3   1500         B      5        1583.333
#  4   2000         A      2        1166.667
Alex Ioannides
  • 1,204
  • 9
  • 10
5

There is a function weighted.mean in base R. Unfortunately, it does not work easily with ave. One solution is to use data.table

library(data.table)
setDT(data)
data[, incomeGroup := weighted.mean(income, weight), by=education]
data
   income education weight incomeGroup
1:   1000         A     10    1166.667
2:   2000         B      1    1583.333
3:   1500         B      5    1583.333
4:   2000         A      2    1166.667

A bizarre method that does work with ave is

ave(df[c("income", "weight")], df$education,
    FUN=function(x) weighted.mean(x$income, x$weight))[[1]]
[1] 1166.667 1583.333 1583.333 1166.667

You feed the subset data.frame to the function and then group by your grouping variable. The FUN argument creates a function that takes a data.frame and applies weighted.mean to the result. As the final output is a data.frame, the [[1]] returns a vector with the desired result.

Note that this is just a proof that this is possible -- I wouldn't recommend this method, the data.table technique is much cleaner and will be much faster on data sets larger than 1000 observations.

lmo
  • 37,904
  • 9
  • 56
  • 69