1

I would like to add the sums of the columns of my dataframe one row at a time, conditional on another column that has a binary variable.

So for each row, I would like to compute the sum of the entire column above it for all rows where the binary variable in the corresponding row has the same value.

Here is an example:

dummy var1  var2
1     x1     y1
0     x2     y2
0     x3     y3
1     x4     y4

My goal is to obtain this:

dummy var1     var2
1     x1       y1
0     x2       y2
0     x3+x2    y3+y2
1     x4+x1    y4+y1

I have asked this question previously for a simplified version (Adding columns sums in dataframe row wise) where I just add all of the values above without the condition. Is there a way to incorporate this condition?

Niccola Tartaglia
  • 1,537
  • 2
  • 26
  • 40
  • you can use `ave` function as mentioned here https://stackoverflow.com/questions/16850207/calculate-cumulative-sum-within-each-id-group – Rajat Mishra Feb 04 '18 at 02:04
  • That worked. Is there an elegant way to apply this to the entire dataframe. With ave I get this far: df$newVar <- ave(df$var1, df$dummy, FUN=cumsum) – Niccola Tartaglia Feb 04 '18 at 02:16
  • Could I use this within apply as the function of my choice? – Niccola Tartaglia Feb 04 '18 at 02:16
  • 1
    Corresponding row has the same value ? Do you mean if in row 5 the dummy is 1 then var1 would be x4+x1+x5, and var2 is y4+y5+y1? Or all the rows with dummy of the same value have the same corresponding value? They are 2 different algorithms. – exteral Feb 04 '18 at 02:29
  • Your `ave` answer looks perfect. How do you want to apply it? Are you worried about scaling up to more than two columns without writing them all out? – Gregor Thomas Feb 04 '18 at 02:33
  • Also agreed with exteralvictor that you need to clarify. Your description makes it sound like only consecutive rows of the same `dummy` value are summed, but your example contradicts that with the `x4 + x1` in the last row. – Gregor Thomas Feb 04 '18 at 02:34
  • ave is perfect, I just don't want to write it out for all the columns. – Niccola Tartaglia Feb 04 '18 at 02:59
  • The example is the correct one. I thought the text was not indicating any consecutiveness, but I changed it, hope it is clearer now. – Niccola Tartaglia Feb 04 '18 at 03:02
  • @ exteralvictor: your 1st example is correct, that is what I meant, so if in row 5 the dummy is 1, I would want the values is row 5 to be: x4+x1+x5, and var2 is y4+y5+y1, as you said. – Niccola Tartaglia Feb 04 '18 at 03:27

4 Answers4

4

data.table::rleid will give you the grouping you want. If you convert your data frame to a data.table, it's like this:

(Note: this assumes that your text is accurate and your example incorrect: it groups by consecutive equal values in the dummy column.)

library(data.table)
setDT(your_data)
your_data[, id := rleid(dummy)][
  , c("var1", "var2") := .(cumsum(var1), cumsum(var2)), by = id
]

If you need to do this to a bunch of columns, set the id as above, define your vector of columns, and then:

cols = c("var1", "var2", "var3", ...)
your_data[, (cols) := lapply(.SD, cumsum), by = id, .SD = cols]

If you just want to group by the dummy column, ignoring consecutiveness, then your question is an exact duplicate of this one, and you can do it like this:

setDT(your_data)
your_data[, c("var1", "var2") := .(cumsum(var1), cumsum(var2)), by = dummy]
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Good answer, I rarely use data.table, figure I should get familiar with that. – exteral Feb 04 '18 at 02:42
  • Is it possible to use the lower group by dummy columns without writing all the column names (I have more than 50 columns). – Niccola Tartaglia Feb 04 '18 at 03:03
  • So your lower alternative (the group by the dummy) works perfectly, but I cannot get this to work when passing a vector of columns. For instance, if I define: cols=c("var1", "var2") and then run your_data[, cols := .(cumsum(var1), cumsum(var2)), by = dummy] The data remains unchanged. – Niccola Tartaglia Feb 04 '18 at 15:51
  • Oh, I just saw in your post that I forgot the brackets. Nevermind. – Niccola Tartaglia Feb 04 '18 at 16:02
3

You can use Reduce:

fun=function(x)Reduce(function(x,y)paste0(y,"+",x),x,accumulate = T)
sapply(dat[-1],function(x)ave(x,dat[,1],FUN = fun))
     var1    var2   
[1,] "x1"    "y1"   
[2,] "x2"    "y2"   
[3,] "x3+x2" "y3+y2"
[4,] "x4+x1" "y4+y1"

If these were just values then you could do :

#Example data
dat2=data.frame(dummy=dat[,1],var1=c(1,2,10,20),var2=c(10,20,50,3))

What to use:

sapply(dat2[-1],function(x)ave(x,dat2[,1],FUN=cumsum))
     var1 var2
[1,]    1   10
[2,]    2   20
[3,]   12   70
[4,]   21   13
Onyambu
  • 67,392
  • 3
  • 24
  • 53
2

Some good answers here already. This is a solution using dplyr:

data.frame(dummy = c(1L,0L,0L,1L), var1 = c(1L,2L,4L,6L), var2 = c(100L,20L,30L,400L)) %>%
    group_by(dummy) %>%
    mutate_all(funs(cumsum))

# A tibble: 4 x 3
# Groups:   dummy [2]
  dummy  var1  var2
  <dbl> <dbl> <dbl>
1  1.00  1.00 100  
2  0     2.00  20.0
3  0     6.00  50.0
4  1.00  7.00 500  
gatsky
  • 1,180
  • 7
  • 8
1

Well I don't think you could do this using a simple function, at least not from my experience. So I suggest writing a function as follows:

sum_new_df  <- function(df){
    new_df <- df[,-1]
    for (i in 1:nrow(df)){
        for (j in (i+1):nrow(df)){
            if (df$dummy[i] == df$dummy[j]){
                new_df[j,] <- df[,-1][j,] + df[,-1][j,]
            }    
        }
    }
}

This function could only sum up the row value of same dummy by increasing row number. So if that is a large data.frame, then the value would be like pyramid.

exteral
  • 991
  • 2
  • 12
  • 33