Restarting a "consecutive counter" variable by group

Question

I am working with the R programming language. I have the following dataset that represents different students taking an exam across a period of dates - each student can either "pass" (1) or "fail" (0) the exam:

# Load the data.table package
library(data.table)

# Generate some sample data
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)

# Create a data table from the sample data
my_data = data.table(id, results, date_exam_taken)
my_data <- my_data[order(id, date_exam_taken)]

# Generate some additional columns for each record
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

   id results date_exam_taken exam_number
1:  1       0      2002-10-06           1
2:  1       1      2003-07-21           2
3:  1       1      2003-10-15           3
4:  1       0      2005-07-21           4
5:  1       1      2014-08-22           5
6:  1       1      2015-09-11           6

Grouped over students, I want to create a variable ("consec") that sums the "consecutive number of passed exams" - and this variable should go back to 0 after the first failed exam.

Based on the answer provided here (Create counter within consecutive runs of certain values), I think I can do this for the entire dataset at once:

 my_data$consec =   unlist(lapply(split(my_data$results, c(0, cumsum(abs(diff(!my_data$results == 1))))), function(x) (x[1] == 1) * seq(length(x))))

However, this is counting consecutive passed exams without taking into consideration individual students - each time a new student appears, I would like the "consec" variable to restart.

Can someone please show me how to do this?

First, there's no need for 100000 rows and several columns to demonstrate your issue. Please learn how to create _minimal_ example, with desired output. You could have used e.g. the values in the post you linked to, doubled it and added a grouping variable. Or even smaller, like `d = data.frame(g = rep(c("a", "b"), 6:5), v = c(1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1))`. Then used e.g. the top-voted answer in the link, in your favorite 'by-group' tool - you are clearly familiar with `ave` since you use it in the question: `d$i = ave(d$v, d$g, FUN = function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)})`. — Henrik, Jan 21 '23 at 10:55
@ Henrik: Thank you for your suggestions - I will try to keep these in mind for the future. — stats_noob, Jan 21 '23 at 17:03
Just to clarify - the answer tailored for my question would look something like this? — stats_noob, Jan 21 '23 at 17:04
my_data$i = ave(my_data$results, my_data$id, FUN = function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)}) — stats_noob, Jan 21 '23 at 17:04
Again, it's easy to verify your code yourself if you create a _minimal_ representative sample of your data ;) Good luck! — Henrik, Jan 21 '23 at 17:15
@ Henrik: thank you for your reply! I am always paranoid - perhaps my code is correct for some minimal representative sample - but perhaps there might be some exception cases where the code will not be correct. — stats_noob, Jan 21 '23 at 18:01
Indeed, I very much recognize that feeling :) I strongly believe that the whole process of actively creating tiny toy data that captures the complexity of your real data, not more and not less, will get you much closer to solve a problem (and not the least, make it easier for potential helpers to try out and verify their code). By "only" sampling 100000 numbers (no offense ;), most of the cases will probably be redundant, and it doesn't really force you to think, in advance, of the cases that your code needs to address. Good luck with your coding (and minimal examples :) Cheers — Henrik, Jan 21 '23 at 18:13

Restarting a "consecutive counter" variable by group

0 Answers0