3

I'm confused about the syntax of dplyr when attempting to compute a weighted mean.

I am following David's suggestion here. The syntax is very transparent and therefore attractive, but it does not appear to work as I expected: Below the weigthed mean is computed for the whole data, rather than grouping by the B variable.

head(df)
# A tibble: 4 × 3
      A     B     P
  <dbl> <dbl> <dbl>
1     1    10   0.4
2     2    10   0.6
3     1    20   0.2
4     2    20   0.8

library(dplyr)
df %>% group_by(B) %>%
    summarise(wm = weighted.mean(A, P))
# wm
# 1 1.7

I can achieve the desired result in several other ways. How can I use dplyr to replicate the calculations below?

# with a slit/apply routine:
sapply(split(df, df$B), function(x) weighted.mean(x$A, x$P))
#  10  20 
# 1.6 1.8 

# with data.table
library(data.table)
setDT(df)[, .(wm = weighted.mean(A, P)), B]
#     B  wm
# 1: 10 1.6
# 2: 20 1.8

# with plyr:
library(plyr)
ddply(df, .(B), summarise, wm = weighted.mean(A, P))
#    B  wm
# 1 10 1.6
# 2 20 1.8

# with aggregate | the formula approach is mysterious
df$wm <- 1:nrow(df)
aggregate(wm ~ B, data=df, function(x) weighted.mean(df$A[x], df$P[x]))
#    B  wm
# 1 10 1.6
# 2 20 1.8
df$wm <- NULL  # no longer needed

Here is the toy data (a tibble, rather than a standard dataframe):

library(tidyverse)
df = structure(list(A = c(1, 2, 1, 2), B = c(10, 10, 20, 20), P = c(0.4, 0.6, 0.2, 0.8)), 
    row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

Here is one and another post about calculating a mean by group with dplyr, but I did not see how they could shed light on my problem.

PatrickT
  • 10,037
  • 9
  • 76
  • 111
  • 3
    Try `dplyr::summarise` you probably have loaded `plyr` packge and `plyr::summarsie` is used. – det Feb 19 '22 at 08:10
  • You're absolutely right. Do you want to post this as an answer? I mean it's a stupid mistake, but also kind of interesting... – PatrickT Feb 19 '22 at 08:13

1 Answers1

4

This is very common thing that happens when package plyr is loaded because plyr::summarise can override dplyr::summarise function. Just use dplyr::summarise. It's the first thing to check if summarise outputs unexpected results.

Another way is to detach the plyr package before using dplyr:

detach("package:plyr")
library("dplyr")
df %>% group_by(B) %>%
    summarise(wm = weighted.mean(A, P))
#       B    wm
#    <dbl> <dbl>
# 1    10   1.6
# 2    20   1.8

PatrickT
  • 10,037
  • 9
  • 76
  • 111
det
  • 5,013
  • 1
  • 8
  • 16
  • I have an `Rmd` document in which I load `tidyverse` and `dplyr`, but NOT `plyr` and this problem comes up too! Not sure if `tidyverse` or some other package has a `summarise` function that takes precedence over `dplyr::summarise`. Be that as it may, the only workaround for me was to ``dplyr::summarise``. – PatrickT Feb 20 '22 at 07:24