-2

I have data which looks like this:

Time ColA ColB ColC
0    1    10   5
1    3    7    15
2    0    8    9
3    3    4    5
4    4    5    6
7    10   23   4

I'd like to group my data into time intervals of equal size and sum the variables of each column. This, for instance, would be the result of grouping the time by 2:

Time ColA ColB ColC
0    4    17   20
2    3    12   14
4    4    5    6
7    10   23   4

I could label the data by introducing a new column whose value is floor(data$Time/2), but it's unclear how to do the summations. Most of the packages I've looked at seem to summarise only a single column, whereas I would like to summarise all the columns.

Richard
  • 56,349
  • 34
  • 180
  • 251
  • A search on the words in the title should have provided many answers. – IRTFM Aug 22 '13 at 18:57
  • Indeed, @DWin. It's unfortunate that in so many cases those answers are either too specific or lack explanation of the working parts. – Richard Aug 22 '13 at 20:11
  • Not a duplicate, @Ferdinand.kraft. That answer seems to combine multiple columns into one, whereas I would like my columns to stay separate. – Richard Aug 22 '13 at 20:13
  • If you want to apply the same function to all columns within groups, then `aggregate` is the base R method to use. – IRTFM Aug 23 '13 at 04:15
  • It may be so, @DWin, but the `data.table` method explained in Andreas' answer is so well-commented and cleanly-written that it's the one I'll be using. – Richard Aug 23 '13 at 15:11
  • Sorry, I didn't mean to imply that you should not use `data.table`. In fact, using data.table is probably a much better strategy in the long run. Some people are looking for a base R solution and aggregate fits the bill in this case. – IRTFM Aug 23 '13 at 15:23

3 Answers3

7

Use the "data.table" package! The syntax is much easier, and the run time is faster.

### Load package
require(data.table)

### Set up variables; Create data.table
time <- c(0:4, 7)
ColA <- c(1, 3, 0, 3, 4, 10)
ColB <- c(10, 7, 8, 4, 5, 23)
ColC <- c(5, 15, 9, 5, 6, 4)
data <- data.table(time, ColA, ColB, ColC)

### Determine which columns we want to apply the function to
sum.cols <- grep("Col", names(data), value = T)   

### Sum each column within each group
data[, lapply(.SD, sum), by = floor(time / 2), .SDcols = sum.cols]

### Output:
    floor ColA ColB ColC
1:     0    4   17   20
2:     1    3   12   14
3:     2    4    5    6
4:     3   10   23    4

Note that the symbol ".SD", refers to a "Subset of Data". In this case, the lapply function iterates over columns of the data table, applying the function "sum" to each column. Within each column, sums are calculated for each level of our "floor" variable.

Andreas
  • 1,923
  • 19
  • 24
2

Just to demonstrate the Ferdinand.Kraft's 'duplicate' call is correct, and arguably closer to what was requested which included the request to see the intervals being created in the original units.

> aggregate(data[-1], list(cut(data$Time, include.lowest=TRUE, 
                            right=FALSE, breaks=seq(range(data$Time)[1], 
                                                  range(data$Time)[2]+1, 
                                                  by=2))) ,
                      sum)

  Group.1 ColA ColB ColC
1   [0,2)    4   17   20
2   [2,4)    3   12   14
3   [4,6)    4    5    6
4   [6,8]   10   23    4
IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

Just for posterity, this is the 'plyr' approach to solve the OP's question. The only real advantage of using 'plyr' functions over 'data.table' functions is that you can use non-data.table objects.

Setup: First, here is the data to use:

data <- read.table(text="
    Time ColA ColB ColC
    0    1    10   5
    1    3    7    15
    2    0    8    9
    3    3    4    5
    4    4    5    6
    7    10   23   4
    ", header=TRUE)

Ply-it: Here are inputting a data frame (d) and outputting a data frame (d), so we'll use the 'ddply' function.

ddply(
    data[, -1], 
    .(Time=floor(data$Time/2)), 
    colSums)

  #   Time ColA ColB ColC
  # 1    0    4   17   20
  # 2    1    3   12   14
  # 3    2    4    5    6
  # 4    3   10   23    4

We are telling 'ddply' to use the variable 'data' for the data (minus the first column that contains the time), to index by floor(data$Time/2), and to create columns with the sums of the rest of the columns by running the 'colSums' function over each group of rows.

Dinre
  • 4,196
  • 17
  • 26