0

I am adding the sums of the columns of a dataframe one row at a time, conditional on another column that has a binary variable.

So for each row, I compute the sum of the entire column above it for all rows where the binary variable in the corresponding row has the same value.

I would like to reverse this so that each row has the sum of the "wrong" group, that is the value for the incorrect dummy value (as part of a robustness test):

Here is an example:

dummy var1  var2
1     x1     y1
0     x2     y2
0     x3     y3
1     x4     y4

Currently I obtain this:

dummy var1     var2
1     x1       y1
0     x2       y2
0     x3+x2    y3+y2
1     x4+x1    y4+y1

I would like to get the wrong values, that is (first row being N/A since there is no value for dummy=0 in that row or above):

dummy var1     var2
1     N/A      N/A
0     x1       y1
0     x1       y1
1     x3+x2    y3+y2

This how I did it so far (I asked this in this forum Adding columns sums in dataframe row wise conditional on a dummy) for the "correct" version:

setDT(df1)
cols = c("var1", "var2", "var3", ...)
df1[, (cols) := lapply(.SD, cummean) , by = dummy, .SD = cols]

I was thinking of just using: -dummy instead of dummy, but that does merely change the label of each group. Is there an easy way to change this?

chinsoon12
  • 25,005
  • 4
  • 25
  • 35
Niccola Tartaglia
  • 1,537
  • 2
  • 26
  • 40

1 Answers1

1

You can take the difference between the cumulative sum of your variables (disregarding the dummy id) and the result that you have so far.

setDT(df1)
cols <- paste0("var", 1:3)

#cumulative sum of each column
df1[, (paste0("cumsum_", cols)) := lapply(.SD, cumsum), .SD=cols]

#your cumulative sum by ID
df1[, (paste0("sumById_", cols)) := lapply(.SD, cumsum), by=dummy, .SD=cols]

#the reverse version, the solution is to think of using subtracting the original cumulative sum by id from the cumulative sum of everything.
rcsCols <- paste0("reverseCumSum_", cols)
df1[, (rcsCols) := lapply(cols, 
    function(x) get(paste0("cumsum_", x)) - get(paste0("sumById_", x)))] 

#then set NA before the first change in dummy
df1[, (rcsCols) := lapply(.SD, function(x) {
    x[seq_along(rle(dummy)$lengths[1])] <- NA
    x
}), .SDcols=rcsCols]

df1    
#   dummy var1 var2 var3 cumsum_var1 cumsum_var2 cumsum_var3 sumById_var1 sumById_var2 sumById_var3 reverseCumSum_var1 reverseCumSum_var2 reverseCumSum_var3
#1:     1    1    5   10           1           5          10            1            5           10                 NA                 NA                 NA
#2:     0    2    6   11           3          11          21            2            6           11                  1                  5                 10
#3:     0    3    7   12           6          18          33            5           13           23                  1                  5                 10
#4:     1    4    8   13          10          26          46            5           13           23                  5                 13                 23

data:

df1 <- data.table(dummy=c(1,0,0,1),
    var1=1:4,
    var2=5:8,
    var3=10:13)
chinsoon12
  • 25,005
  • 4
  • 25
  • 35