1

I would like to calculate for each of several columns the grouped weighted means, but to do this 'in place', by which I mean ending up with the same number of rows as I start, not a summary. I.e., if there are two rows that belong to the same group, they would each have the same weighted averages presented as repeats, rather than these being collapsed into one row that represents them both.

I have this version that works in base R, but is very slow (and at some size seems to crash without producing a result, I think due to running out of memory) for my actual large-ish dataset:

# Some dummy data

test_w <- c(0.5, 1, 1.5, 0.5, 1, 1.5)
test_g <- list(g1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
               g2 = c("Yes", "Yes", "No",  "No", "No", "Yes"))
test_x <- matrix(c(1,  2, 3, 4, 5, 6,
                   10, 9, 8, 7, 6, 5),
                 nrow = 6,
                 dimnames = list(rows = c(),
                                 cols = c("x1", "x2")))

# Gives desired answers:
temp_means_by_groups_1 <- apply(
  test_x, 2,
  FUN = function(x) return (
    ave(test_w * x, test_g, FUN = sum) /
      ave(test_w, test_g, FUN = sum)))

My actual dataset has about 40 'x' columns and about 10,000 rows.

I see from this SO answer that weighted.mean() doesn't play nicely with ave(): https://stackoverflow.com/a/38509589/4957167

So I have tried to do something similar using dplyr / tidyverse:

# A data frame version of the dummy data

test_data <- data.frame(x1 = c(1,  2, 3, 4, 5, 6),
                        x2 = c(10, 9, 8, 7, 6, 5),
                        g1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
                        g2 = c("Yes", "Yes", "No",  "No", "No", "Yes"),
                        w  = c(0.5, 1, 1.5, 0.5, 1, 1.5))

# Doesn't run
temp_means_by_groups_2 <- test_data %>%
  group_by(across(all_of(c("g1", "g2")))) %>%
  mutate(across(all_of(c("x1", "x2")), weighted.mean(w = w))) %>%
  ungroup()

Or rolling my own function:

weighted_means <- function(x) {
  sum(test_w * x) / sum(test_w)
}

w <- test_data$w

# Runs but gives wrong answers (not weighting the means)
temp_means_by_groups_3 <- test_data %>%
  group_by(across(all_of(c("g1", "g2")))) %>%
  mutate(across(all_of(c("x1", "x2")), weighted_means)) %>%
  ungroup()

My ideal answer would be a fast-running solution that works in base R, to minimise dependencies. Actually, speed isn't quite the priority --- running a bit slow would be tolerable if memory usage were kept low enough that this didn't crash out.

My second favourite would be tidyverse, since I'm somewhat familiar with it and am using it elsewhere in my code. From searching around answers that seem relatively close to my goals, I see that data.table is often mentioned; I've never used that, so I would prefer not to get into it, but am open to persuasion.

The code I have inherited happens to be storing everything as separate objects: there's a (numeric) vector of weights, a list containing each of the grouping variables as separate factor objects, and a matrix that contains each of x variables as a column. But I'm happy to either group them together into a single data frame or pass them as separate objects to the code that does this, or whatever is most convenient.

In the returned object, whatever it is, I want the columns for each of the 'x' variables to have the same names as their input ones.

jamse
  • 344
  • 2
  • 14

2 Answers2

2

A minor tweak to your code works. Does it produce the result you're looking for?

library(dplyr, warn.conflicts = FALSE)

test_data <- data.frame(x1 = c(1,  2, 3, 4, 5, 6),
                        x2 = c(10, 9, 8, 7, 6, 5),
                        g1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
                        g2 = c("Yes", "Yes", "No",  "No", "No", "Yes"),
                        w  = c(0.5, 1, 1.5, 0.5, 1, 1.5))

# Now runs
temp_means_by_groups_2 <- test_data %>%
  group_by(across(all_of(c("g1", "g2")))) %>%
  mutate(across(all_of(c("x1", "x2")), ~ weighted.mean(., w = w))) %>%
  ungroup()

temp_means_by_groups_2
#> # A tibble: 6 x 5
#>      x1    x2 g1    g2        w
#>   <dbl> <dbl> <chr> <chr> <dbl>
#> 1  1.67  9.33 Yes   Yes     0.5
#> 2  1.67  9.33 Yes   Yes     1  
#> 3  3     8    Yes   No      1.5
#> 4  4.67  6.33 No    No      0.5
#> 5  4.67  6.33 No    No      1  
#> 6  6     5    No    Yes     1.5

Created on 2021-07-12 by the reprex package (v2.0.0)

And here's a dtplyr version:

library(dplyr, warn.conflicts = FALSE)
library(dtplyr)
library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last

test_data <- data.frame(x1 = c(1,  2, 3, 4, 5, 6),
                        x2 = c(10, 9, 8, 7, 6, 5),
                        g1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
                        g2 = c("Yes", "Yes", "No",  "No", "No", "Yes"),
                        w  = c(0.5, 1, 1.5, 0.5, 1, 1.5)) %>%
  as.data.table() %>%
  lazy_dt(immutable = FALSE)

# Now runs
temp_means_by_groups_2 <- test_data %>%
  group_by(across(all_of(c("g1", "g2")))) %>%
  mutate(across(all_of(c("x1", "x2")), ~ weighted.mean(., w = w))) %>%
  ungroup()

temp_means_by_groups_2
#> Source: local data table [6 x 5]
#> Call:   `_DT1`[, `:=`(x1 = weighted.mean(x1, w = w), x2 = weighted.mean(x2, 
#>     w = w)), by = .(g1, g2)]
#> 
#>      x1    x2 g1    g2        w
#>   <dbl> <dbl> <chr> <chr> <dbl>
#> 1  1.67  9.33 Yes   Yes     0.5
#> 2  1.67  9.33 Yes   Yes     1  
#> 3  3     8    Yes   No      1.5
#> 4  4.67  6.33 No    No      0.5
#> 5  4.67  6.33 No    No      1  
#> 6  6     5    No    Yes     1.5
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

Created on 2021-07-13 by the reprex package (v2.0.0)

Ian Gow
  • 3,098
  • 1
  • 25
  • 31
  • Thanks for this. Yes, that seems to be producing the right results. And it runs nice and quickly, even with my real dataset. – jamse Jul 13 '21 at 10:01
  • Thanks. You might want to accept an answer if one meets your needs. An easy path into `data.table` might be via `dtplyr` (you can test without learning new syntax, at least initially). Though I suspect that for this task there's not much upside (you're doing the same amount of calculation either way). I added an illustration to my answer. – Ian Gow Jul 13 '21 at 20:47
  • Because of the inherited codebase I'm working on, my ideal answer would still be a lower-memory-use (ideally a bit quicker) base R version, so I'll leave this open for a little while to see if anyone can suggest one. But wanted to acknowledge that this dplyr solution does do what I need and does it quickly/efficiently. – jamse Jul 14 '21 at 10:46
1

data.table usually faster for larger dataset, you can try it.

library(data.table)

cols <- c('x1', 'x2')
setDT(test_data)
test_data[, (cols) := lapply(.SD, weighted.mean, w = w),.(g1,g2), .SDcols = cols]

#         x1       x2  g1  g2   w
#1: 1.666667 9.333333 Yes Yes 0.5
#2: 1.666667 9.333333 Yes Yes 1.0
#3: 3.000000 8.000000 Yes  No 1.5
#4: 4.666667 6.333333  No  No 0.5
#5: 4.666667 6.333333  No  No 1.0
#6: 6.000000 5.000000  No Yes 1.5

In base R, you can use split with lapply -

do.call(rbind, lapply(split(test_data, test_data[c('g1', 'g2')]), function(x) {
  x[1:2] <- lapply(x[1:2], weighted.mean, w = x$w)
  x
})) -> test_data

Or by -

do.call(rbind, by(test_data, test_data[c('g1', 'g2')], function(x) {
  x[1:2] <- lapply(x[1:2], weighted.mean, w = x$w)
  x
})) -> test_data
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks for these options. For info, with my real dataset the `split()` version crashed after 3+ hours having used up all the memory, while the `by()` version seemed to run OK, taking about 15 minutes. The `dplyr::mutate()` version by @Ian Gow runs in about 10 seconds. So using `split` is a non-starter but `by` looks like it might be feasible: much slower than dplyr, but avoids dependencies, so a helpful option to have. Thanks also for the data.table version, which I've not checked but is handy to have as an option available. – jamse Jul 13 '21 at 09:50
  • Ah, there seems to be a downside to the `by()` method, in that it seems to reorder the rows. I guess that it would be possible to add an indicator column and re-sort based on that afterwards, though at the cost of an extra step, of course. – jamse Jul 13 '21 at 10:18