0

I am working with the R programming language. I have the following dataset that represents different students taking an exam across a period of dates - each student can either "pass" (1) or "fail" (0) the exam:

# Load the data.table package
library(data.table)

# Generate some sample data
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)

# Create a data table from the sample data
my_data = data.table(id, results, date_exam_taken)
my_data <- my_data[order(id, date_exam_taken)]

# Generate some additional columns for each record
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

   id results date_exam_taken exam_number
1:  1       0      2002-10-06           1
2:  1       1      2003-07-21           2
3:  1       1      2003-10-15           3
4:  1       0      2005-07-21           4
5:  1       1      2014-08-22           5
6:  1       1      2015-09-11           6

Grouped over students, I want to create a variable ("consec") that sums the "consecutive number of passed exams" - and this variable should go back to 0 after the first failed exam.

Based on the answer provided here (Create counter within consecutive runs of certain values), I think I can do this for the entire dataset at once:

 my_data$consec =   unlist(lapply(split(my_data$results, c(0, cumsum(abs(diff(!my_data$results == 1))))), function(x) (x[1] == 1) * seq(length(x))))

However, this is counting consecutive passed exams without taking into consideration individual students - each time a new student appears, I would like the "consec" variable to restart.

Can someone please show me how to do this?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    First, there's no need for 100000 rows and several columns to demonstrate your issue. Please learn how to create _minimal_ example, with desired output. You could have used e.g. the values in the post you linked to, doubled it and added a grouping variable. Or even smaller, like `d = data.frame(g = rep(c("a", "b"), 6:5), v = c(1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1))`. Then used e.g. the top-voted answer in the link, in your favorite 'by-group' tool - you are clearly familiar with `ave` since you use it in the question: `d$i = ave(d$v, d$g, FUN = function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)})`. – Henrik Jan 21 '23 at 10:55
  • @ Henrik: Thank you for your suggestions - I will try to keep these in mind for the future. – stats_noob Jan 21 '23 at 17:03
  • Just to clarify - the answer tailored for my question would look something like this? – stats_noob Jan 21 '23 at 17:04
  • my_data$i = ave(my_data$results, my_data$id, FUN = function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)}) – stats_noob Jan 21 '23 at 17:04
  • Again, it's easy to verify your code yourself if you create a _minimal_ representative sample of your data ;) Good luck! – Henrik Jan 21 '23 at 17:15
  • @ Henrik: thank you for your reply! I am always paranoid - perhaps my code is correct for some minimal representative sample - but perhaps there might be some exception cases where the code will not be correct. – stats_noob Jan 21 '23 at 18:01
  • 2
    Indeed, I very much recognize that feeling :) I strongly believe that the whole process of actively creating tiny toy data that captures the complexity of your real data, not more and not less, will get you much closer to solve a problem (and not the least, make it easier for potential helpers to try out and verify their code). By "only" sampling 100000 numbers (no offense ;), most of the cases will probably be redundant, and it doesn't really force you to think, in advance, of the cases that your code needs to address. Good luck with your coding (and minimal examples :) Cheers – Henrik Jan 21 '23 at 18:13

0 Answers0