Aggregating sub totals and grand totals with data.table

Question

I've got a data.table in R:

library(data.table)
set.seed(1)
DT = data.table(
  group=sample(letters[1:2],100,replace=TRUE), 
  year=sample(2010:2012,100,replace=TRUE),
  v=runif(100))

Aggregating this data into a summary table by group and year is simple and elegant:

table <- DT[,mean(v),by='group, year']

However, aggregating this data into a summary table, including subtotals and grand totals, is a little more difficult, and a lot less elegant:

library(plyr)
yearTot <- DT[,list(mean(v),year='Total'),by='group']
groupTot <- DT[,list(mean(v),group='Total'),by='year']
Tot <- DT[,list(mean(v), year='Total', group='Total')]
table <- rbind.fill(table,yearTot,groupTot,Tot)
table$group[table$group==1] <- 'Total'
table$year[table$year==1] <- 'Total'

This yields:

table[order(table$group, table$year), ]

Is there a simple way to specify subtotals and grand totals with data.table, such as the margins=TRUE command for plyr? I would prefer to use data.table over plyr on my dataset, as it is a very large dataset that I already have in the data.table format.

jangorecki · Accepted Answer · 2017-08-18T15:04:32.870

17

In recent devel data.table you can use new feature called "grouping sets" to produce sub totals:

library(data.table)
set.seed(1)
DT = data.table(
    group=sample(letters[1:2],100,replace=TRUE), 
    year=sample(2010:2012,100,replace=TRUE),
    v=runif(100))

cube(DT, mean(v), by=c("group","year"))
#    group year        V1
# 1:     a 2011 0.4176346
# 2:     b 2010 0.5231845
# 3:     b 2012 0.4306871
# 4:     b 2011 0.4997119
# 5:     a 2012 0.4227796
# 6:     a 2010 0.2926945
# 7:    NA 2011 0.4463616
# 8:    NA 2010 0.4278093
# 9:    NA 2012 0.4271160
#10:     a   NA 0.3901875
#11:     b   NA 0.4835788
#12:    NA   NA 0.4350153
cube(DT, mean(v), by=c("group","year"), id=TRUE)
#    grouping group year        V1
# 1:        0     a 2011 0.4176346
# 2:        0     b 2010 0.5231845
# 3:        0     b 2012 0.4306871
# 4:        0     b 2011 0.4997119
# 5:        0     a 2012 0.4227796
# 6:        0     a 2010 0.2926945
# 7:        2    NA 2011 0.4463616
# 8:        2    NA 2010 0.4278093
# 9:        2    NA 2012 0.4271160
#10:        1     a   NA 0.3901875
#11:        1     b   NA 0.4835788
#12:        3    NA   NA 0.4350153

edited Aug 18 '17 at 15:04

answered Aug 18 '17 at 14:29

jangorecki

16,384
4
79
160

Can you add more detail about the cube function? I can't seem to find any info from ??cube and my R cannot find that function. – DVL Jan 22 '18 at 23:17
1

@DVL all grouping sets functions are available under same manual page and `?cube` should result to this page. You will find examples there. Online version: https://rdatatable.gitlab.io/data.table/library/data.table/html/groupingsets.html – jangorecki Jan 30 '18 at 14:03
@jangorecki: Can it be that this version of data.table is not yet on CRAN? The linke you provided states data.table version 1.10.5; I just updated data.table from CRAN, but only to version 1.10.4-3 – Julian Feb 22 '18 at 15:53
@Julian 1.10.5 is not yet on CRAN, data.table is quite actively developed. You can install from packages sources using `install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table")`. If you want binary package see [Installation](https://github.com/Rdatatable/data.table/wiki/Installation) wiki for details. – jangorecki Feb 24 '18 at 12:44
1

For anyone not following the comment dates closely, this is on CRAN since 1st May 2018. – Jozef Feb 01 '19 at 12:33

Matt Dowle · Answer 2 · 2012-03-20T00:56:38.573

11

I'm not aware of a simple way. Here's a first stab at an implementation. I don't know margins=TRUE in plyr, is this what that does?

crossby = function(DT, j, by) {
    j = substitute(j)
    ans = rbind(
        DT[,eval(j),by],
        DT[,list("Total",eval(j)),by=by[1]],
        cbind("Total",DT[,eval(j),by=by[2]]),
        list("Total","Total",DT[,eval(j)]),
        use.names=FALSE
        # 'use.names' argument added in data.table v1.8.0
    )
    setkeyv(ans,by)
    ans
}

crossby(DT, mean(v), c("group","year"))

      group  year        V1
 [1,]     a  2010 0.2926945
 [2,]     a  2011 0.4176346
 [3,]     a  2012 0.4227796
 [4,]     a Total 0.3901875
 [5,]     b  2010 0.5231845
 [6,]     b  2011 0.4997119
 [7,]     b  2012 0.4306871
 [8,]     b Total 0.4835788
 [9,] Total  2010 0.4278093
[10,] Total  2011 0.4463616
[11,] Total  2012 0.4271160
[12,] Total Total 0.4350153

edited Mar 20 '12 at 00:56

answered Feb 16 '12 at 18:47

Matt Dowle

58,872
22
166
224

2

The tables package seems to work with `DT` although its not clear if its actually using `data.table` operations underneath -- `library(tables); tabular(group + 1 ~ (factor(year) + 1) * v * mean, data = DT)` – G. Grothendieck Feb 16 '12 at 20:19
Matthew, would you please implement something like this directly into data.table? – Michael Apr 17 '13 at 00:29
@Michael Do you mean you'd like me to add `crossby`, as shown above, to the `data.table` package as a new function? Or would you like a `margin` argument added to `[.data.table`? I'm not fully clear. – Matt Dowle Apr 17 '13 at 08:57
I meant a `margin` argument to data.table – Michael Apr 19 '13 at 01:44
3

@Michael. Ok. I've now filed [#2695](https://r-forge.r-project.org/tracker/?group_id=240&atid=978&func=detail&aid=2695) to add that (or similar). Thanks. – Matt Dowle Apr 19 '13 at 11:49
3

@MatthewDowle I don't know if this is because the above code is for older version of `data.table`, but I had to make a few changes to get the above code to work - I had to wrap `by[1]` and `by[2]` in `c()`, I had to add `.SD` to the 2nd `eval` and I had to make sure that `year` is a character column in `DT` (otherwise the `rbind` would convert `Total` to `NA`) – eddi Jun 14 '13 at 15:13
speaking of `eval` and FR's - can we maybe have `eval` default to `.SD` on its own in complicated expressions? – eddi Jun 14 '13 at 15:15
@eddi That's a good idea. Yes please file it. Sorry, not sure why the `c()` is need. It probably is that `data.table` has changed in the last year since I answered here, but I'd hope something would be in NEWS to log the change. – Matt Dowle Jun 14 '13 at 17:10
@MattDowle furthermore, The current key setting method puts "Total" before "a" and "b", because it starts with a capitol "T". Was this change intended? – Michele Oct 04 '13 at 15:34

score 5 · Answer 3 · answered Jul 18 '14 at 15:03

See below for a solution - similar to @MattDowle's above - that takes any number of groups.

crossby2 <- function(data, j, by, grand.total = T, total.label = "(all)", value.label = "value") {
  j = substitute(j)

  # Calculate by each group
  lst <- lapply(1:length(by), function(i) {
    x <- data[, list(..VALUE.. = eval(j)), by = eval(by[1:i])]
    if (i != length(by)) x[, (by[-(1:i)]) := total.label]
    return(x)
  })

  # Grand total
  if (grand.total) lst <- c(lst, list(data[, list(..VALUE.. = eval(j))][, (by) := total.label]))

  # Combine all tables
  res <- rbindlist(lst, use.names = T, fill = F)

  # Change value column name
  setnames(res, "..VALUE..", value.label)

  # Set proper column order
  setcolorder(res, c(by, value.label))

  # Sort values
  setkeyv(res, by)

  return(res)
}

jangorecki · Answer 4 · 2015-10-05T22:19:39.143

Using current answers I've added support for multiple measures and aggregate functions and can add aggregation level indicator.

#' @title SQL's ROLLUP function
#' @description Returns data.table of aggregates value for each level of hierarchy provided in `by`.
#' @param x data.table input data.
#' @param j expression to evaluate in `j`, support multiple measures.
#' @param by character a hierarchy level for aggregations.
#' @param level logical, use `TRUE` to add `level` column of sub-aggregation.
#' @seealso [postgres: GROUPING SETS, CUBE, and ROLLUP](http://www.postgresql.org/docs/9.5/static/queries-table-expressions.html#QUERIES-GROUPING-SETS), [SO: Aggregating sub totals and grand totals with data.table](http://stackoverflow.com/a/24828162/2490497)
#' @return data.table
#' @examples 
#' set.seed(1)
#' x = data.table(group=sample(letters[1:2],100,replace=TRUE),
#'                year=sample(2010:2012,100,replace=TRUE),
#'                v=runif(100))
#' rollup(x, .(vmean=mean(v), vsum=sum(v)), by = c("group","year"))
library(data.table)
rollup = function(x, j, by, level=FALSE){
    stopifnot(is.data.table(x), is.character(by), length(by) >= 2L, is.logical(level))
    j = substitute(j)
    aggrs = rbindlist(c(
        lapply(1:(length(by)-1L), function(i) x[, eval(j), c(by[1:i])][, (by[-(1:i)]) := NA]), # subtotals
        list(x[, eval(j), c(by)]), # leafs aggregations
        list(x[, eval(j)][, c(by) := NA]) # grand total
    ), use.names = TRUE, fill = FALSE)
    if(level) aggrs[, c("level") := sum(sapply(.SD, is.na)), 1:nrow(aggrs), .SDcols = by]
    setcolorder(aggrs, neworder = c(by, names(aggrs)[!names(aggrs) %in% by]))
    setorderv(aggrs, cols = by, order=1L, na.last=TRUE)
    return(aggrs[])
}
set.seed(1)
x = data.table(group=sample(letters[1:2],100,replace=TRUE),
               year=sample(2010:2012,100,replace=TRUE),
               month=sample(1:12,100,replace=TRUE),
               v=runif(100))
rollup(x, .(vmean=mean(v), vsum=sum(v)), by = c("group","year","month"), level=TRUE)

nope, would be nice to have something like this in data.table, I've added FR to see if there is a chance for speedup such operations in C, if it will not be possible then maybe such wrappers can be proposed in PR. — jangorecki, Oct 05 '15 at 22:28
@DavidArenburg Since now it is. It is much more precise as it allows to calculate only chosen aggregates levels skipping the rest. The `levels` argument takes integer vector indicating various levels of aggregate. You can find the function here [rollup.R#L11](https://gitlab.com/jangorecki/data.cube/blob/v0.2/R/rollup.R#L11). — jangorecki, Nov 19 '15 at 01:26

score 1 · Answer 5 · answered Jul 27 '17 at 04:50

Borrowing from this answer (https://stackoverflow.com/a/39536828/4241780), the below provides an all-subsets summary (unlike crossby2, and rollup which appear to miss rows 9 to 11 of the OP's desired output). This function is expandable to any number of by or aggregate variables, although in its current state only allows one type of aggregation function. Great for calculating row substotals by group interactions (what I used it for).

add_col_sums.data.table <- function(data, aggvars, byvars, FUN = sum, level = "level") {

  # Find all possible subsets of your data
  subsets <- lapply(0:length(byvars), combn, x = byvars, simplify = FALSE)
  subsets <- do.call(c, subsets)

  # Calculate summary value by each subset
  agg_values <- lapply(subsets, function(x) 
    data[,lapply(.SD, FUN), by = x, .SDcols = aggvars])

  # Pull them all into one dataframe
  dat_out <- rbindlist(agg_values, fill = TRUE)

  # Order columns and rows
  setorderv(dat_out, byvars, na.last = TRUE)
  setcolorder(dat_out, c(byvars, aggvars))

  # Add level indication
  dat_out[, c(level) := Reduce("+", lapply(.SD, is.na))]

  # Return data.table
  dat_out[]

}

add_col_sums.data.table(DT, "v", c("group", "year"), FUN = mean)

Very nice solution. Worked first time, unlike the others on a many column set. — Chris, Feb 10 '18 at 03:00

Aggregating sub totals and grand totals with data.table

5 Answers5

Linked