Calculating 95% confidence intervals for a weighted median over grouped data in dplyr

Question

I have a dataset with several groups, where I want to calculate a median value for each group using dplyr. The data are weighted, and the weights need to be taken into account in calculating the median. I found the weighted.median function from spatstat which seems to work fine. Consider the following simplified example:

require(spatstat, dplyr)

tst <- data.frame(group = rep(c(1:5), each = 100))
tst$val = runif(500) * tst$group
tst$wt = runif(500) * tst$val

tst %>%
  group_by(group) %>%
  summarise(weighted.median(val, wt))

# A tibble: 5 × 2
  group `weighted.median(val, wt)`
  <int>                      <dbl>
1     1                      0.752
2     2                      1.36 
3     3                      1.99 
4     4                      2.86 
5     5                      3.45

However, I would also like to add 95% confidence intervals to these values, and this has me stumped. Things I've considered:

Spatstat also has a weighted.var function but there's no documentation, and it's not even clear to me whether this is variance around the median or mean.
This rcompanion post suggests various methods for calculating CIs around medians, but as far as I can tell none of them handle weights.
This blog post suggests a function for calculating CIs and a median for weighted data, and is the closest I can find to what I need. However, it doesn't work with my dplyr groupings. I suppose I could write a loop to do this one group at a time and build the output data frame, but that seems cumbersome. I'm also not totally sure I understand the function in the post and slightly suspicious of its results- for instance, testing this out I get wider estimates for alpha=0.1 than for alpha=0.05, which seems backwards to me. Edit to add: upon further investigation, I think this function works as intended if I use alpha=0.95 for 95% CIs, rather than alpha = 0.05 (at least, this returns values that feel intuitively about right). I can also make it work with dplyr by editing to return just a single moe value rather than a pair of high/low estimates. So this may be a good option- but I'm also considering others.

Is there an existing function in some library somewhere that can do what I want, or an otherwise straightforward way to implement this?

Your best best is to bootstrap the confidence intervals or use the `survey` package quantile estimator for the 0.5 quantile, passing your weights with the appropriate weight argument. Beware that the sample median is a biased-but-consistent estimator of the population median: https://stats.stackexchange.com/questions/36134/an-unbiased-estimate-of-the-median — socialscientist, Jul 26 '22 at 22:57

score 3 · Answer 1 · answered Jul 27 '22 at 13:22

There are several approaches.

You could use the asymptotic formula for standard error of the sample median. The sample median is asymptotically normal with standard error 1/sqrt(4 n f(m)) where n is the number of observations, m is the true median, and f(x) is the probability density of the (weighted) random variable. You could estimate the probability density using the base R function density.default with the weights argument. If x is the vector of observed values and w the corresponding vector of weights, then

med <- weighted.median(x, w)
f <- density(x, weights=w)
fmed <- approx(f$x, f$y, xout=med)$y
samplesize <- length(x)
se <- 1/sqrt(4 * samplesize * fmed)
ci <- med + c(-1,1) * 1.96 * se

This relies on several asymptotic approximations so it may be inaccurate. Also the sample size depends on the interpretation of the weights. In some cases the sample size could be equal to sum(w).

If there is very little data in each group, you could use the even simpler normal reference approximation,

med <- weighted.median(x, w)
v <- weighted.var(x, w)
sdm <- sqrt(pi/2) * sqrt(v)
samplesize <- length(x)
se <- sdm/sqrt(samplesize)
ci <- med + c(-1,1) * 1.96 * se

Alternatively you could use bootstrapping - generate random resamples of the input data (by choosing random resamples of the indices 1, 2, ..., n), extract the corresponding weighted observations (x_i, w_i), compute the weighted median of each resampled dataset, and construct the 95% confidence interval. (This approach implicitly assumes the sample size is equal to n)

Your first approach gives a suspiciously narrow confidence intervals for data with higher numeric values. For instance, if I set `x = c(1:100)` and `w = c(1:100) / sum(c(1:100))`, the first approach gives an answer of 70.5 +/- 0.83, while the second gives 70.5 +/- 5.86. These are already fairly different answers, but if I multiple x by 100, I get 7050 +/- 8.3 from the first approach and 7050 +/- 586 from the second. Intuitively, the first approach is giving too low of an MOE here, and it doesn't make that it only goes up 10X when the data are uniformly multiplied by 100. Any insight? — Joe, Jul 27 '22 at 16:49

Calculating 95% confidence intervals for a weighted median over grouped data in dplyr

1 Answers1