I would like to calculate for each of several columns the grouped weighted means, but to do this 'in place', by which I mean ending up with the same number of rows as I start, not a summary. I.e., if there are two rows that belong to the same group, they would each have the same weighted averages presented as repeats, rather than these being collapsed into one row that represents them both.
I have this version that works in base R, but is very slow (and at some size seems to crash without producing a result, I think due to running out of memory) for my actual large-ish dataset:
# Some dummy data
test_w <- c(0.5, 1, 1.5, 0.5, 1, 1.5)
test_g <- list(g1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
g2 = c("Yes", "Yes", "No", "No", "No", "Yes"))
test_x <- matrix(c(1, 2, 3, 4, 5, 6,
10, 9, 8, 7, 6, 5),
nrow = 6,
dimnames = list(rows = c(),
cols = c("x1", "x2")))
# Gives desired answers:
temp_means_by_groups_1 <- apply(
test_x, 2,
FUN = function(x) return (
ave(test_w * x, test_g, FUN = sum) /
ave(test_w, test_g, FUN = sum)))
My actual dataset has about 40 'x' columns and about 10,000 rows.
I see from this SO answer that weighted.mean()
doesn't play nicely with ave()
: https://stackoverflow.com/a/38509589/4957167
So I have tried to do something similar using dplyr / tidyverse:
# A data frame version of the dummy data
test_data <- data.frame(x1 = c(1, 2, 3, 4, 5, 6),
x2 = c(10, 9, 8, 7, 6, 5),
g1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
g2 = c("Yes", "Yes", "No", "No", "No", "Yes"),
w = c(0.5, 1, 1.5, 0.5, 1, 1.5))
# Doesn't run
temp_means_by_groups_2 <- test_data %>%
group_by(across(all_of(c("g1", "g2")))) %>%
mutate(across(all_of(c("x1", "x2")), weighted.mean(w = w))) %>%
ungroup()
Or rolling my own function:
weighted_means <- function(x) {
sum(test_w * x) / sum(test_w)
}
w <- test_data$w
# Runs but gives wrong answers (not weighting the means)
temp_means_by_groups_3 <- test_data %>%
group_by(across(all_of(c("g1", "g2")))) %>%
mutate(across(all_of(c("x1", "x2")), weighted_means)) %>%
ungroup()
My ideal answer would be a fast-running solution that works in base R, to minimise dependencies. Actually, speed isn't quite the priority --- running a bit slow would be tolerable if memory usage were kept low enough that this didn't crash out.
My second favourite would be tidyverse, since I'm somewhat familiar with it and am using it elsewhere in my code. From searching around answers that seem relatively close to my goals, I see that data.table is often mentioned; I've never used that, so I would prefer not to get into it, but am open to persuasion.
The code I have inherited happens to be storing everything as separate objects: there's a (numeric) vector of weights, a list containing each of the grouping variables as separate factor objects, and a matrix that contains each of x variables as a column. But I'm happy to either group them together into a single data frame or pass them as separate objects to the code that does this, or whatever is most convenient.
In the returned object, whatever it is, I want the columns for each of the 'x' variables to have the same names as their input ones.