Conditional running count (cumulative sum) with reset in R (dplyr)

Question

I'm trying to calculate a running count (i.e., cumulative sum) that is conditional on other variables and that can reset for particular values on another variable. I'm working in R and would prefer a dplyr-based solution, if possible.

I'd like to create a variable for the running count, cumulative, based on the following algorithm:

Calculate the running count (cumulative) within combinations of id and age
Increment running count (cumulative) by 1 for every subsequent trial where accuracy = 0, block = 2, and condition = 1
Reset running count (cumulative) to 0 for each trial where accuracy = 1, block = 2, and condition = 1, and the next increment resumes at 1 (not the previous number)
For each trial where block != 2, or condition != 1, leave the running count (cumulative) as NA

Here's a minimal working example:

mydata <- data.frame(id = c(1,1,1,1,1,1,1,1,1,1,1),
                 age = c(1,1,1,1,1,1,1,1,1,1,2),
                 block = c(1,1,2,2,2,2,2,2,2,2,2),
                 trial = c(1,2,1,2,3,4,5,6,7,8,1),
                 condition = c(1,1,1,1,1,2,1,1,1,1,1),
                 accuracy = c(0,0,0,0,0,0,0,1,0,0,0)
)

id  age block   trial   condition   accuracy
1   1   1       1       1           0
1   1   1       2       1           0
1   1   2       1       1           0
1   1   2       2       1           0
1   1   2       3       1           0
1   1   2       4       2           0
1   1   2       5       1           0
1   1   2       6       1           1
1   1   2       7       1           0
1   1   2       8       1           0
1   2   2       1       1           0

The expected output is:

id  age block   trial   condition   accuracy    cumulative
1   1   1       1       1           0           NA
1   1   1       2       1           0           NA
1   1   2       1       1           0           1
1   1   2       2       1           0           2
1   1   2       3       1           0           3
1   1   2       4       2           0           NA
1   1   2       5       1           0           4
1   1   2       6       1           1           0
1   1   2       7       1           0           1
1   1   2       8       1           0           2
1   2   2       1       1           0           1

akrun · Answer 1 · 2018-10-24T04:08:10.893

Here is an option using data.table. Create a binary column based on matching the pasted values of 'accuracy', 'block', 'condition' with that of the custom values, grouped by run-length-id of the binary column ('ind'), 'id' and 'age', get the cumulative sum of 'ind' and assign (:=) it to a new column ('Cumulative')

library(data.table)
setDT(mydata)[, ind := match(do.call(paste0, .SD), c("121", "021")) - 1,
    .SDcols = c("accuracy", "block", "condition")
     ][, Cumulative := cumsum(ind), .(rleid(ind), id, age)
      ][, ind := NULL][]
#    id age block trial condition accuracy Cumulative
# 1:  1   1     1     1         1        0         NA
# 2:  1   1     1     2         1        0         NA
# 3:  1   1     2     1         1        0          1
# 4:  1   1     2     2         1        0          2
# 5:  1   1     2     3         1        0          3
# 6:  1   1     2     4         2        0         NA
# 7:  1   1     2     5         1        1          0
# 8:  1   1     2     6         1        0          1
# 9:  1   1     2     7         1        0          2
#10:  1   2     2     1         1        0          1

This answer is interesting, but it doesn't fully work with my data. I added a row to the data (see Trial 5). — itpetersen, Oct 26 '18 at 02:29

Ronak Shah · Accepted Answer · 2018-10-29T04:46:54.757

We can use case_when to assign the value which we need based on our conditions. We then add an additional group_by condition using cumsum to switch values when the temp column 0. In the final mutate step we temporarily replace NA values in temp to 0, then take cumsum over it and put back the NA values again to it's place to get the final output.

library(dplyr)

mydata %>%
    group_by(id, age) %>%
    mutate(temp = case_when(accuracy == 0 & block == 2 & condition == 1 ~ 1, 
                            accuracy == 1 & block == 2 & condition == 1 ~ 0, 
                            TRUE ~ NA_real_)) %>%
    ungroup() %>%
    group_by(id, age, group = cumsum(replace(temp == 0, is.na(temp), 0))) %>%
    mutate(cumulative = replace(cumsum(replace(temp, is.na(temp), 0)),
                          is.na(temp), NA)) %>%
    select(-temp, -group)


#    group    id   age block trial condition accuracy cumulative
#   <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>      <dbl>
# 1     0     1     1     1     1         1        0         NA
# 2     0     1     1     1     2         1        0         NA
# 3     0     1     1     2     1         1        0          1
# 4     0     1     1     2     2         1        0          2
# 5     0     1     1     2     3         1        0          3
# 6     0     1     1     2     4         2        0         NA
# 7     0     1     1     2     5         1        0          4
# 8     1     1     1     2     6         1        1          0
# 9     1     1     1     2     7         1        0          1
#10     1     1     1     2     8         1        0          2
#11     1     1     2     2     1         1        0          1

Conditional running count (cumulative sum) with reset in R (dplyr)

2 Answers2

Linked