I have a large data frame (200k rows) consisting of monthly trial data. Each variable records the result of the trial in that month; positive (1) or negative (0). The file also contains unique ids and a number of factor variables for use in analysis. Here is a simplified example for illustration:
w <- c(101, 0, 0, 0, 1, 1, 1, 5)
x <- c(102, 0, 0, 0, 0, 0, 0, 3)
y <- c(103, 1, 0, 0, 0, 0, 0, 2)
z <- c(104, 1, 1, 1, 0, 0, 0, 2)
dfrm <- data.frame(rbind(w,x,y,z), row.names = NULL)
names(dfrm) <- c("id","jan","feb","mar","apr","may","jun","start")
The trial participants all joined at different times; the final column is an index giving the column in which that participant's first trial result is recorded. Results for months prior to the participant joining are recorded as zeros (as in the first row of the example).
I want to identify the first sequence of three consecutive zeros per participant, and then return the position of the start of that 3-zero sequence; but limiting my search only to the columns since they started the trial (those from the index column onwards).
My approach - and I'm sure there are many - has been to split this into two tasks: writing NAs to those test results that occurred before the participant joined, using a for loop:
for (i in 1:nrow(dfrm)){
if(dfrm$start[i] > 2)
dfrm[i,2:(dfrm$start[i]-1)] <- NA
}
before using a match loop on the full range of data now that the rogue early zeros have been set to NA:
for (i in 1:nrow(dfrm)){
f <- match(c(0,0,0), dfrm[i,2:7])
dfrm$outputmth[i] <- f[1]
}
dfrm$outputmth <- dfrm$outputmth - (dfrm$start - 2)
Which is successful (I think) in generating my desired output: the first occurrence of 3 successive zeros per participant when active, and NA where no occurrence was found.
This involved some clunky workarounds; in particular the second loop returning a list of 3 values in f from which I have to select only the first item to populate dfrm$outputmth.
But more importantly, running this code on the full data set has taken around 30mins to execute. So, feeling a little embarassed, I'm hoping there is at least one more efficient way to write and run this?
Many thanks for any assistance.