10

I have a data.table that I would like to perform group-by operations on, but would like to retain the null variables and use different group-by variable sets.

A toy example:

library(data.table)
set.seed(1)
DT <- data.table(
        id = sample(c("US", "Other"), 25, replace = TRUE), 
        loc = sample(LETTERS[1:5], 25, replace = TRUE), 
        index = runif(25)
        )

I would like to find the sum of index by all combinations of the key variables (including the null set). The concept is analogous to "grouping sets" in Oracle SQL, here is an example of my current workaround:

rbind(
  DT[, list(id = "", loc = "", sindex = sum(index)), by = NULL],
  DT[, list(loc = "", sindex = sum(index)), by = "id"],
  DT[, list(id = "", sindex = sum(index)), by = "loc"],
  DT[, list(sindex = sum(index)), by = c("id", "loc")]
)[order(id, loc)]
       id loc      sindex
 1:           11.54218399
 2:         A  2.82172063
 3:         B  0.98639578
 4:         C  2.89149433
 5:         D  3.93292900
 6:         E  0.90964424
 7: Other      6.19514146
 8: Other   A  1.12107080
 9: Other   B  0.43809711
10: Other   C  2.80724742
11: Other   D  1.58392886
12: Other   E  0.24479728
13:    US      5.34704253
14:    US   A  1.70064983
15:    US   B  0.54829867
16:    US   C  0.08424691
17:    US   D  2.34900015
18:    US   E  0.66484697

Is there a preferred "data table" way to accomplish this?

mlegge
  • 6,763
  • 3
  • 40
  • 67
  • 2
    If you really want the results in a `data.table`, what you do here is fine. If you're just going to look at the results, a tabular format, with your cross-classifying variables on the margins, is much better: `stable <- tapply(DT$index,list(DT$id,DT$loc),sum); mstable <- rbind(cbind(stable,apply(stable,1,sum)),c(apply(stable,2,sum),sum(stable)))`. By the way, please use `set.seed` when generating random data for an example. – Frank Mar 06 '15 at 00:31
  • Oh, it's actually simpler than that, since `addmargins` works: `addmargins(stable)` – Frank Mar 06 '15 at 00:34
  • 1
    [Oracle GROUPING SETS](https://oracle-base.com/articles/misc/rollup-cube-grouping-functions-and-grouping-sets#grouping_sets) could be done by a data.table high level functions, no internals required, the similar what you did it. I recommend to fill a feature request for that feature. – jangorecki Jul 18 '15 at 12:12
  • 3
    Just for the record, there is an open FR [data.table#1377](https://github.com/Rdatatable/data.table/issues/1377). Also you can find `rollup` generic function example in this [SO](http://stackoverflow.com/a/32938770/2490497) and a little more generic [rollup.data.table](https://github.com/jangorecki/data.cube/blob/dba1ba8ca4203021ecc2cd4442d58c142a40fc76/R/rollup.R#L11) method. Still they don't directly answer *grouping sets* so I'm just putting in comment. – jangorecki Dec 13 '15 at 00:09

3 Answers3

1

As of this commit, this is now possible with the dev version of data.table with cube or groupingsets:

library("data.table")
# data.table 1.10.5 IN DEVELOPMENT built 2017-08-08 18:31:51 UTC
#   The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
#   Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
#   Release notes, videos and slides: http://r-datatable.com

cube(DT, list(sindex = sum(index)), by = c("id", "loc"))
#        id loc      sindex
#  1:    US   B  0.54829867
#  2:    US   A  1.70064983
#  3: Other   B  0.43809711
#  4: Other   E  0.24479728
#  5: Other   C  2.80724742
#  6: Other   A  1.12107080
#  7:    US   E  0.66484697
#  8:    US   D  2.34900015
#  9: Other   D  1.58392886
# 10:    US   C  0.08424691
# 11:    NA   B  0.98639578
# 12:    NA   A  2.82172063
# 13:    NA   E  0.90964424
# 14:    NA   C  2.89149433
# 15:    NA   D  3.93292900
# 16:    US  NA  5.34704253
# 17: Other  NA  6.19514146
# 18:    NA  NA 11.54218399

groupingsets(DT, j = list(sindex = sum(index)), by = c("id", "loc"), sets = list(character(), "id", "loc", c("id", "loc")))
#        id loc      sindex
#  1:    NA  NA 11.54218399
#  2:    US  NA  5.34704253
#  3: Other  NA  6.19514146
#  4:    NA   B  0.98639578
#  5:    NA   A  2.82172063
#  6:    NA   E  0.90964424
#  7:    NA   C  2.89149433
#  8:    NA   D  3.93292900
#  9:    US   B  0.54829867
# 10:    US   A  1.70064983
# 11: Other   B  0.43809711
# 12: Other   E  0.24479728
# 13: Other   C  2.80724742
# 14: Other   A  1.12107080
# 15:    US   E  0.66484697
# 16:    US   D  2.34900015
# 17: Other   D  1.58392886
# 18:    US   C  0.08424691
mlegge
  • 6,763
  • 3
  • 40
  • 67
0

I have a generic function that you can feed in a dataframe and a vector of dimensions you wish to group by, and it will return the sum of all numeric fields grouped by those dimensions.

rollSum = function(input, dimensions){
   #cast dimension inputs to character in case a dimension input is numeric 
   for (x in 1:length(dimensions)){
       input[[eval(dimensions[x])]] = as.character(input[[eval(dimensions[x])]])
    }
    numericColumns = which(lapply(input,class) %in% c("integer", "numeric")) 
    output = input[,lapply(.SD, sum, na.rm = TRUE), by = eval(dimensions),
    .SDcols = numericColumns]
    return(output)
}

So then you can create a list of your different group by vectors:

groupings = list(c("id"),c("loc"),c("id","loc"))

And then use it with lapply and rbindlist in the way of:

groupedSets = rbindlist(lapply(groupings, function(x){
    return(rollSum(DT,x))}), fill = TRUE)
-1

using dplyr, an adaption of this should work, if I understand your question correctly.

sum <- mtcars %>%
  group_by(vs, am) %>%
  summarise(Sum=sum(mpg))

I didnt check how it treats the missung values though,but it should just make another group of them (last group).

yoland
  • 504
  • 4
  • 13