Sum multiple columns by group

Question

I have data which looks like this:

Time ColA ColB ColC
0    1    10   5
1    3    7    15
2    0    8    9
3    3    4    5
4    4    5    6
7    10   23   4

I'd like to group my data into time intervals of equal size and sum the variables of each column. This, for instance, would be the result of grouping the time by 2:

Time ColA ColB ColC
0    4    17   20
2    3    12   14
4    4    5    6
7    10   23   4

I could label the data by introducing a new column whose value is floor(data$Time/2), but it's unclear how to do the summations. Most of the packages I've looked at seem to summarise only a single column, whereas I would like to summarise all the columns.

A search on the words in the title should have provided many answers. — IRTFM, Aug 22 '13 at 18:57
Indeed, @DWin. It's unfortunate that in so many cases those answers are either too specific or lack explanation of the working parts. — Richard, Aug 22 '13 at 20:11
Not a duplicate, @Ferdinand.kraft. That answer seems to combine multiple columns into one, whereas I would like my columns to stay separate. — Richard, Aug 22 '13 at 20:13
If you want to apply the same function to all columns within groups, then `aggregate` is the base R method to use. — IRTFM, Aug 23 '13 at 04:15
It may be so, @DWin, but the `data.table` method explained in Andreas' answer is so well-commented and cleanly-written that it's the one I'll be using. — Richard, Aug 23 '13 at 15:11
Sorry, I didn't mean to imply that you should not use `data.table`. In fact, using data.table is probably a much better strategy in the long run. Some people are looking for a base R solution and aggregate fits the bill in this case. — IRTFM, Aug 23 '13 at 15:23

Andreas · Accepted Answer · 2013-08-22T18:23:56.047

7

Use the "data.table" package! The syntax is much easier, and the run time is faster.

### Load package
require(data.table)

### Set up variables; Create data.table
time <- c(0:4, 7)
ColA <- c(1, 3, 0, 3, 4, 10)
ColB <- c(10, 7, 8, 4, 5, 23)
ColC <- c(5, 15, 9, 5, 6, 4)
data <- data.table(time, ColA, ColB, ColC)

### Determine which columns we want to apply the function to
sum.cols <- grep("Col", names(data), value = T)   

### Sum each column within each group
data[, lapply(.SD, sum), by = floor(time / 2), .SDcols = sum.cols]

### Output:
    floor ColA ColB ColC
1:     0    4   17   20
2:     1    3   12   14
3:     2    4    5    6
4:     3   10   23    4

Note that the symbol ".SD", refers to a "Subset of Data". In this case, the lapply function iterates over columns of the data table, applying the function "sum" to each column. Within each column, sums are calculated for each level of our "floor" variable.

edited Aug 22 '13 at 18:23

answered Aug 22 '13 at 18:17

Andreas

1,923
19
24

1

you don't need to create a separate variable: `data[, lapply(.SD, sum), by = floor(time / 2)]` – eddi Aug 22 '13 at 18:24
well done you beat me to it this is definitely the way to go – statquant Aug 22 '13 at 18:27
1

It's always nice forego creating unnecessary variables, thanks @eddi! – Andreas Aug 22 '13 at 18:28

score 2 · Answer 2 · answered Aug 22 '13 at 20:25

Just to demonstrate the Ferdinand.Kraft's 'duplicate' call is correct, and arguably closer to what was requested which included the request to see the intervals being created in the original units.

> aggregate(data[-1], list(cut(data$Time, include.lowest=TRUE, 
                            right=FALSE, breaks=seq(range(data$Time)[1], 
                                                  range(data$Time)[2]+1, 
                                                  by=2))) ,
                      sum)

  Group.1 ColA ColB ColC
1   [0,2)    4   17   20
2   [2,4)    3   12   14
3   [4,6)    4    5    6
4   [6,8]   10   23    4

score 0 · Answer 3 · answered Aug 22 '13 at 18:46

Just for posterity, this is the 'plyr' approach to solve the OP's question. The only real advantage of using 'plyr' functions over 'data.table' functions is that you can use non-data.table objects.

Setup: First, here is the data to use:

data <- read.table(text="
    Time ColA ColB ColC
    0    1    10   5
    1    3    7    15
    2    0    8    9
    3    3    4    5
    4    4    5    6
    7    10   23   4
    ", header=TRUE)

Ply-it: Here are inputting a data frame (d) and outputting a data frame (d), so we'll use the 'ddply' function.

ddply(
    data[, -1], 
    .(Time=floor(data$Time/2)), 
    colSums)

  #   Time ColA ColB ColC
  # 1    0    4   17   20
  # 2    1    3   12   14
  # 3    2    4    5    6
  # 4    3   10   23    4

We are telling 'ddply' to use the variable 'data' for the data (minus the first column that contains the time), to index by floor(data$Time/2), and to create columns with the sums of the rest of the columns by running the 'colSums' function over each group of rows.

Sum multiple columns by group

3 Answers3

Linked