4

For some years now I've been using the Hmisc package and base R to compute weighted statistical summaries. Typically, I use a double weight with one being a spatial influence weight and the other a data support value such as length, volume, physical density and so on. Using the 'mtcars' dataset as an example where mpg is the variable of interest and the double weight contrived from car 'wt' and 'hp', the Hmisc + base R workflow is typically like the one below.

require(Hmisc)

mtcars$Wt2 <- mtcars$wt * mtcars$hp               # double weight
mtcars$Acc <- mtcars$Wt2 * mtcars$mpg             # accumulation

min(mtcars$mpg)                                   # min
sqrt(wtd.var(mtcars$mpg, mtcars$mpg))             # wtd sd
wtd.quantile(mtcars$mpg, mtcars$mpg,0.05)         # wtd 5th 
wtd.quantile(mtcars$mpg, mtcars$mpg,0.50)         # wtd median
wtd.quantile(mtcars$mpg, mtcars$mpg,0.95)         # wtd 95th
max(mtcar$mpg)                                    # max

Using a loop it is then possible to filter and compute these weighted statistics for each zone of interest in a larger data frame. However, as I'm attempting to learn how to use dplyr I wondering how to compute these weighted statistics. A weighted mean option is available but the others will require a bit more work. Below is the code where I've worked out how to compute the weighted mean from first principles(and checked this against the dplyr inbuilt function). However, I'm stumped as tp how to proceed to compute the weighted SD and quantiles by group using dplyr as I somehow need to get the squared weighted mean differences (for each group) into the chain of pipes.

mtcars %>% 
  mutate(Car = row.names(mtcars)) %>%  # variable for car name
  mutate(Wt2 = wt * hp) %>%            # double weight
  mutate(Acc = Wt2 * mpg) %>%          # weighted consumption
  group_by(Car) %>%                    # group by car type
  summarise(n = n(),
            SmWt2 = sum(Wt2),                    # Sum of double weight
            SmAcc = sum(Acc),                    # Sum of accumulations
            WtMn = SmAcc/SmWt2,                  # Weighted mean
            WtMnChk = weighted.mean(mpg, Wt2)    # Check weighted mean
            )
Markm0705
  • 1,340
  • 1
  • 13
  • 31
  • Does [this question](https://stackoverflow.com/questions/40369832/assign-intermediate-output-to-temp-variable-as-part-of-dplyr-pipeline) point you in the right direction? – Kent Orr Jul 24 '21 at 04:16
  • 1
    In the first part `Acc` is Wt2 * Wt2, in the 2nd it's Wt2 * mpg, which is intended? – Jon Spring Jul 24 '21 at 04:59
  • 1
    I'm unclear what you mean by weighted average when the groups are all single cars. I would think you'd want to get weighted average among a group, say all the 4 cylinder cars... – Jon Spring Jul 24 '21 at 05:05
  • Jon - thanks for spotting that error - I have corrected the example to mpg. Your comment about the weighted average for single cars is also sensible - I should have made an example for as you suggest as that would make more sense. However, the crux of the question is about weighted calculations with plyr, which is currently limited to only the weighted mean and weighted residuals. – Markm0705 Jul 24 '21 at 06:31

1 Answers1

3

I'm not sure I understand exactly the approach you're working on, but here's an example of finding the weighted average and weighted standard deviation by gear, using wt as the weighting:

library(dplyr)
datasets::mtcars %>% 
  group_by(gear) %>%
  summarize(n = n(),
            mpg_weighted_by_weight = sum(mpg*wt) / sum(wt),
            mpg_weighted_by_weight_check = weighted.mean(mpg, wt),
            
            mpg_sd = sqrt(sum(wt * ((mpg - mpg_weighted_by_weight)^2))/(sum(wt)-1)),
            mpg_sd_check = sqrt(Hmisc::wtd.var(mpg, wt)))


# A tibble: 3 x 6
   gear     n mpg_weighted_by_weight mpg_weighted_by_weight_check mpg_sd mpg_sd_check
* <dbl> <int>                  <dbl>                        <dbl>  <dbl>        <dbl>
1     3    15                   15.6                         15.6   3.32         3.32
2     4    12                   23.6                         23.6   4.81         4.81
3     5     5                   19.7                         19.7   5.63         5.63

I wasn't familiar with the formula for weighted standard deviation, but rather cheated and relied on the formula from Hmisc::wtd.var. If you control-click on the formula name in RStudio, it shows the underlying code of the function. Most of it is error handling until the bottom:

#Hmisc::wtd.var
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased", 
  "ML")) 
{
  # ...  skipping error handling
  sw <- sum(weights)
  # ...
  xbar <- sum(weights * x)/sw
  sum(weights * ((x - xbar)^2))/(sw - 1)
}
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Nice solution - I did not realize I could call Hsmic functions inside dplyr - perhaps that's the best approach to avoid working from first principles – Markm0705 Jul 24 '21 at 06:37