0

I didn't see another question quite like this on SO. Let's say I have a dataframe of consisting of 1...m = M variables I want to aggregate over, plus two other grouping variables I plan to aggregate by (m+2 columns total). One of the variables in M needs to be averaged, and the other m-1 should be summed.

I'm familiar with various ways to aggregate data in R, including performing different functions on separate columns. For example, I like this data.table method:

df <- df[, list(w = sum(w), x = mean(x)), by = c("y", "z")]

which sums w and averages y by two groups, y,z.

After some research, it also looks like there's a dplyr function across(-x, sum) that can subject all variables meeting a condition to an aggregation function -- in this case, sum all variables other than x.

I'm having trouble figuring out how to use these together. In my case, I have > 1,000,000 rows and 200+ columns, so it's not feasible to cbind() everything on the LHS of an aggregate()-type function.

  • How might I apply multiple functions (mean, sum) with an across() technique? (My instinct is it might involve some merging.)
  • Is there a data.table solution? (Not necessary, but helpful for speed.)

MWE: Consider the iris dataset.

library(datasets)
data(iris)
summary(iris)

which has four numeric columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and one factor (Species).

I want to group by species, and within that group, aggregate each of the other four columns. However, Sepal.Length should be averaged (mean), while Sepal.Width, Petal.Length, Petal.Width should be summed (summed).

I can do this by naming each variable as follows:

library(data.table)
iris <- data.table(iris)
iris[, list(Sepal.Length = mean(Sepal.Length), 
            Sepal.Width = sum(Sepal.Width),
            Petal.Length = sum(Petal.Length),
            Petal.Width = sum(Petal.Width)), by = c("Species")]

How can I generalize the procedure so that it's only necessary to name the exceptions (mean(Sepal.Length))?

As an alternative to naming only the exceptions (in the manner of -Sepal.Length), I could also exhaustively name all the indices; eg:

iris[, list(iris[,1] = mean(iris[,1]), 
            iris[,2:4] = sum(iris[,2:4])), by = c(Species)] 
# doesn't work of course
beddotcom
  • 447
  • 3
  • 11
  • 2
    I would prefer to have a MWE to be more confident that I am interpreting your description properly, to save me time making up example data, and to verify that my solution works as you want it to. – Jon Spring Oct 17 '22 at 22:30
  • Let's say you have a group ("x1|y1") that has two rows and 250 columns of data. For that group, are you looking for an average for the two rows in one column and sums for each of the other columns, or a single sum for all the other columns? – Jon Spring Oct 17 '22 at 22:49
  • Relevant? https://stackoverflow.com/a/32734549/6851825 – Jon Spring Oct 17 '22 at 22:54
  • Sure! Just added a MWE for clarity – beddotcom Oct 17 '22 at 23:05
  • To answer your question: I want sums for each of the other columns, not a single sum. The result should have the same ncol. – beddotcom Oct 17 '22 at 23:12

2 Answers2

2

One of the great things about dplyr::across is that it doesn't have to be used in isolation.

your_data %>% 
  group_by(group1, group2) %>%
  summarize(across(-col_to_mean, sum), mean_col = mean(col_to_mean))

That works perfectly in my head, but if you want it tested please do provide a MWE.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thanks! This solution works for one column exception. Is it possible to list more than one exception in "across()"? Also, is there a data.table solution for when speed is an issue? – beddotcom Oct 17 '22 at 23:15
  • Took 3 seconds to run this on this 1M x 250 col fake data: `set.seed(0); rows = 1E6; cols = 250; df1 <- matrix(rpois(rows*cols, 10), ncol = 250) %>% as.data.frame() %>% mutate(grp1 = sample(LETTERS, rows, replace = TRUE), grp2 = sample(letters, rows, replace = TRUE), .before = 1)`. That's fast enough for many use cases. Will likely be larger data.table advantage the more groups you have. You could add `dtplyr::lazy_dt() %>%` as the 2nd line to port to `data.table`. – Jon Spring Oct 17 '22 at 23:16
  • 1
    "data.table equivalent of group_by() %>% mutate(across())" https://stackoverflow.com/a/73315342/6851825 – Jon Spring Oct 17 '22 at 23:21
  • Your machine is faster than mine, Jon. Thanks for explaining the bottleneck and for linking a related question. The other question that occurs to me -- this solution is designed for one column exception, which is what I asked about. However, is it possible to list more than one exception? – beddotcom Oct 17 '22 at 23:21
  • 1
    `across(-c(one_exception, some:others)...)` – Jon Spring Oct 17 '22 at 23:22
  • You can use any of the [selection helpers](https://tidyselect.r-lib.org/reference/language.html) inside `across()`, matching on a name pattern, regex, position, etc. to either find the columns you want to work with or the ones you want to exclude. – Gregor Thomas Oct 17 '22 at 23:24
1

In data.table you could do something like this, note that lapply over one column for the mean here is a bit silly, but this is a pretty universal way if you want to apply different functions over a set of columns.

iris[, lapply(.SD[, 1], mean), Species][iris[, lapply(.SD[, 2:4], sum), Species], on = "Species"]

#       Species Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1:     setosa        5.006       171.4         73.1        12.3
# 2: versicolor        5.936       138.5        213.0        66.3
# 3:  virginica        6.588       148.7        277.6       101.3
Merijn van Tilborg
  • 5,452
  • 1
  • 7
  • 22