21

Suppose I have a dataframe such that:

df<-data.frame(id=1:8,group=c(1,0,0,1,1,0,1,0),rep=c(rep("d1",4),rep("d2",4)),value=rbinom(8,1,0.6))
df
  id group rep value
1  1     1  d1     0
2  2     0  d1     0
3  3     0  d1     0
4  4     1  d1     1
5  5     1  d2     1
6  6     0  d2     0
7  7     1  d2     1
8  8     0  d2     1

What's the best way to get the cumulative sum by group and rep such that:

cumsum
group d1  d1+d2  d1+d2+d3
0     0     1      ...
1     1     3      ...
David Z
  • 6,641
  • 11
  • 50
  • 101
  • 2
    `cumsum` is a function. Type `?cumsum`. There is also a `by` function, which has documentation linking to similar functions (type `?by` and look under "See also")... so, `by(df$value,df$group,cumsum)` is one way – Frank Apr 11 '14 at 17:59
  • May you give an answer using the demo? +1 – David Z Apr 11 '14 at 18:04
  • 1
    To get it all in a data frame: `do.call(rbind, by(df$value,df$group,cumsum))` – josliber Apr 11 '14 at 18:05
  • Error in FUN(X[[1L]], ...) : could not find function "FUN" – David Z Apr 11 '14 at 18:07
  • The questions are even not similar if you look closely. – David Z Apr 11 '14 at 18:23
  • 1
    Oh weird -- my comment disappeared. Anyway, I was saying that @josilber meant `do.call(cbind,...` which will not give an error. Ah, I see that your question is quite unique. I went off the title instead of your desired output initially. – Frank Apr 11 '14 at 18:34

2 Answers2

17

I'd recommend working with the tidy form of the data. Here's an approach with dplyr, but it would be trivial to translate to data.table or base R.

First I'll create the dataset, setting the random seed to make the example reproducible:

set.seed(1014)
df <- data.frame(
  id = 1:8,
  group = c(1, 0, 0, 1, 1, 0, 1, 0),
  rep = c(rep("d1", 4), rep("d2", 4)),
  value = rbinom(8, 1, 0.6)
)
df

%>   id group rep value
%> 1  1     1  d1     1
%> 2  2     0  d1     0
%> 3  3     0  d1     0
%> 4  4     1  d1     1
%> 5  5     1  d2     1
%> 6  6     0  d2     1
%> 7  7     1  d2     1
%> 8  8     0  d2     1

Next, using dplyr, I'll first collapse to individual rows by group, and then compute the cumulative sum:

library(dplyr)

df <- df %>% 
  group_by(group, rep) %>%
  summarise(value = sum(value)) %>%
  mutate(csum = cumsum(value))
df

%> Source: local data frame [4 x 4]
%> Groups: group
%> 
%>   group rep value csum
%> 1     0  d1     0    0
%> 2     0  d2     2    2
%> 3     1  d1     2    2
%> 4     1  d2     2    4

For most cases, you're best of leaving the data in this form (it will be easier to work for), but you can reshape if you need to:

library(reshape2)

dcast(df, group ~ rep, value.var = "csum")

%>   group d1 d2
%> 1     0  0  2
%> 2     1  2  4
Giacomo
  • 1,796
  • 1
  • 24
  • 35
hadley
  • 102,019
  • 32
  • 183
  • 245
13
library(data.table)

# convert to data.table in place
setDT(df)

# dcast and do individual sums
dt.cast = dcast.data.table(df, group ~ rep, value.var = 'value',
                           fun.aggregate = sum)
dt.cast
#   group d1 d2
#1:     0  0  1
#2:     1  1  2

# cumsum
dt.cast[, as.list(cumsum(unlist(.SD))), by = group]
#   group d1 d2
#1:     0  0  1
#2:     1  1  3
eddi
  • 49,088
  • 6
  • 104
  • 155