3

This is an extension of the questions asked here: Aggregate / summarize multiple variables per group (e.g. sum, mean).

  • Specifically, if I have multiple variables to aggregate, is there a way to change the FUN each variable is aggregated by?

Example:

dat <- data.frame(ID = rep(letters[1:3], each =3), Plot = rep(1:3,3),Val1 = (1:9)*10, Val2 = (1:9)*20)

> dat
  ID Plot Val1 Val2
1  a    1   10   20
2  a    2   20   40
3  a    3   30   60
4  b    1   40   80
5  b    2   50  100
6  b    3   60  120
7  c    1   70  140
8  c    2   80  160
9  c    3   90  180


#Aggregate 2 variables using the *SAME* FUN
  aggregate(cbind(Val1, Val2) ~ ID, dat, sum)

  ID Val1 Val2
1  a   60  120
2  b  150  300
3  c  240  480
  • but notice that both variables are summed.

What if I want to take the sum of Val1 and the mean of Val2??

The best solution I have is:

merge(
  aggregate(Val1 ~ ID, dat, sum),
  aggregate(Val2 ~ ID, dat, mean),
  by = c('ID')
)
  • But I'm wondering if their is a cleaner/shorter way to go about doing this...

Can I do this all in Aggregate???

  • (I didn't see anything in the aggregate code that made it seem like this could work, but I've been wrong before...)

Example #2:

(as requested, using mtcars)
Reduce(function(df1, df2) merge(df1, df2, by = c('cyl','am'), all = T),
    list(
    aggregate(hp ~ cyl + am, mtcars, sum, na.rm = T),
    aggregate(wt ~ cyl + am, mtcars, min), 
    aggregate(qsec ~ cyl + am, mtcars, mean, na.rm = T),
    aggregate(mpg ~ cyl + am, mtcars, mean, na.rm = T)
  )
)

#I'd want a straightforward alternative like:
  aggregate(cbind(hp,wt,qsec,mpg) ~ cyl + am, mtcars, list(sum, min, mean, mean), na.rm = T)

  # ^(I know this doesn't work)

Note: I would prefer a base R approach, but I already realize dplyr or some other package probably does this "better"

theforestecologist
  • 4,667
  • 5
  • 54
  • 91
  • I'm sure there's a cleaner way + you'll still have to do some manipulating: `do.call('rbind', by(dat, dat$ID, FUN = function(x) data.frame(sum_v1 = sum(x$Val1), mean_v2 = mean(x$Val2))))` – bouncyball Apr 30 '18 at 19:13
  • Note: in reality, my data contains *dozens* of variables I want to aggregate with at least 3 different functions, so I'm looking for a solution that scales up effectively :). – theforestecologist Apr 30 '18 at 19:17
  • Re "dozens of variables", so you want to pass them programmatically rather than explicitly typing their names? Anyway, one clumsy variant of bouncy's approach: `do.call("rbind", with(dat, tapply(seq_len(nrow(dat)), ID, FUN = function(i) data.frame(sumV1 = sum(Val1[i]), meanV2 = mean(Val2[i])))))` – Frank Apr 30 '18 at 19:20
  • I can scale up to more than 2 data.frames using `Reduce` (*ala* [Simultaneously merge multiple data.frames in a list](https://stackoverflow.com/a/34393416/4581200)) – theforestecologist Apr 30 '18 at 19:39
  • "Dozens of variables... at least 3 different functions". You should make your example more reflective of your use case. Do you want to hand-code each variable/function pair? If so, standard `dplyr` or `data.table` should work fine. Or do you have, say, a vector of column names for `FUN1`, another for `FUN2`...? Or something else? Is it a 1-1 mapping or might you want the mean and sum and max of one column, and just the mean of another? Instead of making up data, just use `mtcars`. You can group by the `cyl` column and have lots of numeric columns to play with. – Gregor Thomas Apr 30 '18 at 19:47
  • Or, if you want to stick with `aggregate` and you only have 3 functions, just make 3 aggregate calls and `cbind` the results together. If you have the columns you want for each function in any sort of object (please share it if you do) you could make this programmatic pretty easily. – Gregor Thomas Apr 30 '18 at 19:49
  • @Gregor see my update – theforestecologist Apr 30 '18 at 20:03

2 Answers2

3

Consider pairwise mapping of columns and functions and then run Map to build a list of aggregated dataframes since aggregate allows string values of function names. Then run a Reduce to merge all dataframe elements together.

cols <- names(dat)[grep("Val", names(dat))]
fcts <- c("mean", "sum")

df_list <- Map(function(c, f) aggregate(.~ID, dat[c("ID", c)], FUN=f), cols, fcts)

final_df <- Reduce(function(x,y) merge(x, y, by="ID"), df_list)

final_df
#   ID Val1 Val2
# 1  a   20  120
# 2  b   50  300
# 3  c   80  480

Be sure columns and functions vectors are same length, possibly needing to repeat functions.

And to demonstrate with mtcars:

cols <- c("hp", "wt", "qsec", "mpg")
fcts <- c("sum", "min", "mean", "mean")

df_list <- Map(function(c, f) aggregate(.~cyl+am, mtcars[c("cyl", "am", c)], FUN=f), cols, fcts)

Reduce(function(x,y) merge(x,y, by=c("cyl", "am")), df_list)

#   cyl am   hp    wt     qsec      mpg
# 1   4  0  254 2.465 20.97000 22.90000
# 2   4  1  655 1.513 18.45000 28.07500
# 3   6  0  461 3.215 19.21500 19.12500
# 4   6  1  395 2.620 16.32667 20.56667
# 5   8  0 2330 3.435 17.14250 15.05000
# 6   8  1  599 3.170 14.55000 15.40000
Parfait
  • 104,375
  • 17
  • 94
  • 125
1

You can use summarise from the dplyr package

library(dplyr)

dat <- data.frame(ID = rep(letters[1:3], each =3), Plot = rep(1:3,3),Val1 = (1:9)*10, Val2 = (1:9)*20)
dat

#>   ID Plot Val1 Val2
#> 1  a    1   10   20
#> 2  a    2   20   40
#> 3  a    3   30   60
#> 4  b    1   40   80
#> 5  b    2   50  100
#> 6  b    3   60  120
#> 7  c    1   70  140
#> 8  c    2   80  160
#> 9  c    3   90  180

dat %>% 
  group_by(ID) %>% 
  summarise(sum_val1 = sum(Val1, na.rm = TRUE),
            mean_val2 = mean(Val2, na.rm = TRUE)) %>%
  ungroup()

#> # A tibble: 3 x 3
#>   ID    sum_val1 mean_val2
#>   <fct>    <dbl>     <dbl>
#> 1 a           60        40
#> 2 b          150       100
#> 3 c          240       160

Created on 2018-04-30 by the reprex package (v0.2.0).

Tung
  • 26,371
  • 7
  • 91
  • 115