194

From a data frame, is there a easy way to aggregate (sum, mean, max etc) multiple variables simultaneously?

Below are some sample data:

library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05)) 
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)

I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month. The following code aggregates the x1 variable, but is it also possible to simultaneously aggregate the x2 variable?

### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)
zephryl
  • 14,633
  • 3
  • 11
  • 30
MikeTP
  • 7,716
  • 16
  • 44
  • 57

10 Answers10

211

Yes, in your formula, you can cbind the numeric variables to be aggregated:

aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
   year month         x1          x2
1  2000     1   7.862002   -7.469298
2  2001     1 276.758209  474.384252
3  2000     2  13.122369 -128.122613
...
23 2000    12  63.436507  449.794454
24 2001    12 999.472226  922.726589

See ?aggregate, the formula argument and the examples.

Henrik
  • 65,555
  • 14
  • 143
  • 159
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • 3
    Is it possible for the cbind to use dynamic variables? – pdb Nov 13 '15 at 05:29
  • 15
    It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting. – pdb Nov 13 '15 at 06:19
  • 2
    what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month) – Clock Slave Mar 16 '16 at 11:22
  • 11
    @ClockSlave, then you need to just use `.` on the LHS. `aggregate(. ~ year + month, df1, sum, na.rm = TRUE)`. In this example, `sum` for "date" doesn't make sense though.... – A5C1D2H2I1M1N2O1R2T1 Mar 21 '16 at 03:53
  • 5
    What if I don't want two variables but two functions?. For example mean and sd. – skan Apr 14 '16 at 19:15
  • 1
    In the case of `NA`s this approach is really problematic. Setting `na.rm = TRUE` does not affect anything and the `NA` cases are ignored... – DatamineR Jun 23 '17 at 16:03
  • 1
    @andrie. The use of `.` in the formula interface mentioned recently in the comments is probably worth adding to the answer. – lmo Jul 13 '17 at 02:05
  • Is there a way to perform different functions (e.g., `mean`, `max`, `min` ,etc.) to each of the different variables in `cbind`? – theforestecologist Apr 30 '18 at 18:50
  • can i use a vector of names in `cbind`.? `cbind(x1,x2,x3)` works. But `x=c(x1,x2,x3); cbind(x)~` gives `Variable lengths differ` error – sjd Jul 10 '20 at 11:33
64

With the dplyr package, you can use across() to aggregate multiple variables using tidyselect language. For the example dataset you can do this as follows:

library(dplyr)
set.seed(13)

# summarising all non-grouping variables
df1 %>% group_by(year, month) %>% summarise(across(everything(), n_distinct))

# summarising a specific set of non-grouping variables
df1 %>% group_by(year, month) %>% summarise(across(x1:x2, sum))
df1 %>% group_by(year, month) %>% summarise(across(c(x1, x2), sum))
df1 %>% group_by(year, month) %>% summarise(across(-date, sum))

# summarising a specific set of non-grouping variables using selection helpers:
df1 %>% group_by(year, month) %>% summarise(across(starts_with('x'), sum))
df1 %>% group_by(year, month) %>% summarise(across(matches('.*[0-9]'), sum))

# summarising a specific set of non-grouping variables based on condition (class)
df1 %>% group_by(year, month) %>% summarise(across(where(is.numeric), sum))

All but the first of these result in:

# A tibble: 24 × 4
# Groups:   year [2]
    year month     x1    x2
   <dbl> <dbl>  <dbl> <dbl>
 1  2000     1  131.   27.4
 2  2000     2   44.8 155. 
 3  2000     3   60.7 207. 
 4  2000     4  -11.5 379. 
 5  2000     5   64.0 441. 
 6  2000     6  -16.5 517. 
 7  2000     7  210.  530. 
 8  2000     8  112.  573. 
 9  2000     9 -129.  347. 
10  2000    10 -165.  444. 
# … with 14 more rows

You can also apply multiple functions to the selected columns:

df1 %>% 
  group_by(year, month) %>% 
  summarise(across(x1:x2, list(sum = sum, avg = mean)))
# A tibble: 24 × 6
# Groups:   year [2]
    year month x1_sum x1_avg x2_sum x2_avg
   <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1  2000     1  131.   4.24    27.4  0.884
 2  2000     2   44.8  1.54   155.   5.34 
 3  2000     3   60.7  1.96   207.   6.69 
 4  2000     4  -11.5 -0.385  379.  12.6  
 5  2000     5   64.0  2.06   441.  14.2  
 6  2000     6  -16.5 -0.550  517.  17.2  
 7  2000     7  210.   6.76   530.  17.1  
 8  2000     8  112.   3.60   573.  18.5  
 9  2000     9 -129.  -4.30   347.  11.6  
10  2000    10 -165.  -5.33   444.  14.3  
# … with 14 more rows

A few final notes:

  • By default, summarise() drops the last level of grouping, so all the examples above would still be grouped by year. To drop all grouping, you can add an ungroup() call, or set .groups = "drop" in the summarise() call.
  • As of dplyr 1.1.0, you can alternatively specify grouping for just the summarise operation using the .by argument, e.g., df1 %>% summarise(across(c(x1, x2), sum), .by = c(year, month))
  • across() also works with other dplyr verbs such as mutate() and reframe().
  • Prior to the introduction of across() with dplyr 1.0.0, these kinds of operations were done with summarise_all(), summarise_at(), summarise_if(), and (even earlier) by summarise_each(). These are now superseded or deprecated in favor of across().
zephryl
  • 14,633
  • 3
  • 11
  • 30
Jaap
  • 81,064
  • 34
  • 182
  • 193
58

Using the data.table package, which is fast (useful for larger datasets)

https://github.com/Rdatatable/data.table/wiki

library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by = .(year, month), .SDcols = c("x1","x2")]
setDF(df2) # convert back to dataframe

Using the plyr package

require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))

Using summarize() from the Hmisc package (column headings are messy in my example though)

# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))
Jaap
  • 81,064
  • 34
  • 182
  • 193
numbercruncher
  • 924
  • 6
  • 10
  • 1
    why not do this for data.table option: `dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month)` ? – Bulat Oct 13 '18 at 12:00
  • @Bulat because that scales really poorly to more columns. Your proposal is fine for 2 columns, but `.SDcols` is a good general solution that works just as well for 2 columns, 20 columns, or 200 columns. – Gregor Thomas Jun 09 '23 at 14:50
48

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
#  year month         x1           x2
1  2000     1  -80.83405 -224.9540159
2  2000     2 -223.76331 -288.2418017
3  2000     3 -188.83930 -481.5601913
4  2000     4 -197.47797 -473.7137420
5  2000     5 -259.07928 -372.4563522
Striezel
  • 3,693
  • 7
  • 23
  • 37
EDi
  • 13,160
  • 2
  • 48
  • 57
  • 10
    The `recast` function (also from `reshape2`) integrates the `melt` and `dcast` function in one go for tasks like this: `recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))` – Jaap May 13 '16 at 06:17
22

Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:

aggregate(
  x = df1[c("x1", "x2")],
  by = df1[c("year", "month")],
  FUN = sum, na.rm = TRUE
)

More generic use of aggregate's data.frame method:

Since we are providing a

  • data.frame as x and
  • a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple
  • also with custom-made aggregation functions

For example like so:

colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
  c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}

aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)
Jozef
  • 2,617
  • 14
  • 19
  • 2
    First, this is a criminally underrated answer, thanks for it. Second, do you think there is possible to generate an `aggregate` function that computes the `mean` for a given set of variables and `sum` for another set of variables? I am really curious about it, and it could save me a lot of time. Thx again. – Álvaro A. Gutiérrez-Vargas Dec 04 '20 at 17:10
9

With the dplyr version >= 1.0.0, we can also use summarise to apply function on multiple columns with across

library(dplyr)
df1 %>% 
    group_by(year, month) %>%
    summarise(across(starts_with('x'), sum))
# A tibble: 24 x 4
# Groups:   year [2]
#    year month     x1     x2
#   <dbl> <dbl>  <dbl>  <dbl>
# 1  2000     1   11.7  52.9 
# 2  2000     2  -74.1 126.  
# 3  2000     3 -132.  149.  
# 4  2000     4 -130.    4.12
# 5  2000     5  -91.6 -55.9 
# 6  2000     6  179.   73.7 
# 7  2000     7   95.0 409.  
# 8  2000     8  255.  283.  
# 9  2000     9  489.  331.  
#10  2000    10  719.  305.  
# … with 14 more rows
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    that is now on CRAN :) - however, there seem to be performance issues with using across - (which I can confirm on my data), here a related thread https://github.com/tidyverse/dplyr/issues/4953 – tjebo Sep 30 '20 at 21:51
4

For a more flexible and faster approach to data aggregation, check out the collap function in the collapse R package available on CRAN:

library(collapse)
# Simple aggregation with one function
head(collap(df1, x1 + x2 ~ year + month, fmean))

  year month        x1        x2
1 2000     1 -1.217984  4.008534
2 2000     2 -1.117777 11.460301
3 2000     3  5.552706  8.621904
4 2000     4  4.238889 22.382953
5 2000     5  3.124566 39.982799
6 2000     6 -1.415203 48.252283

# Customized: Aggregate columns with different functions
head(collap(df1, x1 + x2 ~ year + month, 
      custom = list(fmean = c("x1", "x2"), fmedian = "x2")))

  year month  fmean.x1  fmean.x2 fmedian.x2
1 2000     1 -1.217984  4.008534   3.266968
2 2000     2 -1.117777 11.460301  11.563387
3 2000     3  5.552706  8.621904   8.506329
4 2000     4  4.238889 22.382953  20.796205
5 2000     5  3.124566 39.982799  39.919145
6 2000     6 -1.415203 48.252283  48.653926

# You can also apply multiple functions to all columns
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax)))

  year month  fmean.x1    fmin.x1  fmax.x1  fmean.x2   fmin.x2  fmax.x2
1 2000     1 -1.217984 -4.2460775 1.245649  4.008534 -1.720181 10.47825
2 2000     2 -1.117777 -5.0081858 3.330872 11.460301  9.111287 13.86184
3 2000     3  5.552706  0.1193369 9.464760  8.621904  6.807443 11.54485
4 2000     4  4.238889  0.8723805 8.627637 22.382953 11.515753 31.66365
5 2000     5  3.124566 -1.5985090 7.341478 39.982799 31.957653 46.13732
6 2000     6 -1.415203 -4.6072295 2.655084 48.252283 42.809211 52.31309

# When you do that, you can also return the data in a long format
head(collap(df1, x1 + x2 ~ year + month, list(fmean, fmin, fmax), return = "long"))

  Function year month        x1        x2
1    fmean 2000     1 -1.217984  4.008534
2    fmean 2000     2 -1.117777 11.460301
3    fmean 2000     3  5.552706  8.621904
4    fmean 2000     4  4.238889 22.382953
5    fmean 2000     5  3.124566 39.982799
6    fmean 2000     6 -1.415203 48.252283

Note: You can use base functions like mean, max etc. with collap, but fmean, fmax etc. are C++ based grouped functions offered in the collapse package which are significantly faster (i.e. the performance on large data aggregations is the same as data.table while providing greater flexibility, and these fast grouped functions can also be used without collap).

Note2: collap also supports flexible multitype data aggregation, which you can of course do using the custom argument, but you can also apply functions to numeric and non-numeric columns in a semi-automated way:

# wlddev is a data set of World Bank Indicators provided in the collapse package
head(wlddev)

      country iso3c       date year decade     region     income  OECD PCGDP LIFEEX GINI       ODA
1 Afghanistan   AFG 1961-01-01 1960   1960 South Asia Low income FALSE    NA 32.292   NA 114440000
2 Afghanistan   AFG 1962-01-01 1961   1960 South Asia Low income FALSE    NA 32.742   NA 233350000
3 Afghanistan   AFG 1963-01-01 1962   1960 South Asia Low income FALSE    NA 33.185   NA 114880000
4 Afghanistan   AFG 1964-01-01 1963   1960 South Asia Low income FALSE    NA 33.624   NA 236450000
5 Afghanistan   AFG 1965-01-01 1964   1960 South Asia Low income FALSE    NA 34.060   NA 302480000
6 Afghanistan   AFG 1966-01-01 1965   1960 South Asia Low income FALSE    NA 34.495   NA 370250000

# This aggregates the data, applying the mean to numeric and the statistical mode to categorical columns
head(collap(wlddev, ~ iso3c + decade, FUN = fmean, catFUN = fmode))

  country iso3c       date   year decade                     region      income  OECD    PCGDP   LIFEEX GINI      ODA
1   Aruba   ABW 1961-01-01 1962.5   1960 Latin America & Caribbean  High income FALSE       NA 66.58583   NA       NA
2   Aruba   ABW 1967-01-01 1970.0   1970 Latin America & Caribbean  High income FALSE       NA 69.14178   NA       NA
3   Aruba   ABW 1976-01-01 1980.0   1980 Latin America & Caribbean  High income FALSE       NA 72.17600   NA 33630000
4   Aruba   ABW 1987-01-01 1990.0   1990 Latin America & Caribbean  High income FALSE 23677.09 73.45356   NA 41563333
5   Aruba   ABW 1996-01-01 2000.0   2000 Latin America & Caribbean  High income FALSE 26766.93 73.85773   NA 19857000
6   Aruba   ABW 2007-01-01 2010.0   2010 Latin America & Caribbean  High income FALSE 25238.80 75.01078   NA       NA

# Note that by default (argument keep.col.order = TRUE) the column order is also preserved
Sebastian
  • 1,067
  • 7
  • 12
2

An updated dplyr solution: since dplyr 1.1.0, you can use .by in summarise to do an inline temporary grouping (which automatically ungroups after the computation).

Using across (available from dplyr 1.0.0) allows to use the same function for multiple columns at the same time.

library(dplyr)
df1 %>%
  summarise(across(starts_with('x'), sum), .by = c(year, month))

# A tibble: 24 x 4
#    year month     x1     x2
#   <dbl> <dbl>  <dbl>  <dbl>
# 1  2000     1   11.7  52.9 
# 2  2000     2  -74.1 126.  
# 3  2000     3 -132.  149.  
# 4  2000     4 -130.    4.12
# 5  2000     5  -91.6 -55.9 
# 6  2000     6  179.   73.7 
# 7  2000     7   95.0 409.  
# 8  2000     8  255.  283.  
# 9  2000     9  489.  331.  
#10  2000    10  719.  305.  
# … with 14 more rows
Maël
  • 45,206
  • 3
  • 29
  • 67
  • does not seem to work. Check mtcars %>% + summarise(across(starts_with('d'), sum), .by = cyl) – Hugues Jan 12 '23 at 12:15
  • This works with `dplyr` development version (not yet on CRAN). Check with `devtools::install_github('tidyverse/dplyr')` first – Maël Jan 12 '23 at 12:44
1

Below is another way to summarize multiple columns, especially useful when the function needs further arguments. You can select all columns via everything() or a subset of columns like any_of(c("a", "b")).

library(dplyr)
# toy data
df <- tibble(a = sample(c(NA, 5:7), 30, replace = TRUE), 
             b = sample(c(NA, 1:5), 30, replace = TRUE), 
             c = sample(1:5, 30, replace = TRUE), 
             grp = sample(1:3, 30, replace = TRUE))
df
#> # A tibble: 30 × 4
#>        a     b     c   grp
#>    <int> <int> <int> <int>
#>  1     7     1     3     1
#>  2     7     4     4     2
#>  3     5     1     3     3
#>  4     7    NA     3     2
#>  5     7     2     5     2
#>  6     7     4     4     2
#>  7     7    NA     3     3
#>  8    NA     5     4     1
#>  9     5     1     1     2
#> 10    NA     3     1     2
#> # … with 20 more rows
df %>% 
  group_by(grp) %>%
  summarise(across(everything(), 
                   list(mean = ~mean(., na.rm = TRUE),
                        q75 = ~quantile(., probs = .75, na.rm = TRUE))))
#> # A tibble: 3 × 7
#>     grp a_mean a_q75 b_mean b_q75 c_mean c_q75
#>   <int>  <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
#> 1     1   6.6      7   2.88  4.25   3        4
#> 2     2   6.33     7   2.62  3.25   2.9      4
#> 3     3   5.78     6   3.33  4      3.09     4
HBat
  • 4,873
  • 4
  • 39
  • 56
-1

Late to the party, but recently found another way to get the summary statistics.

library(psych) describe(data)

Will output: mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.

britt
  • 79
  • 10