0

I have a large data frame (200k rows) consisting of monthly trial data. Each variable records the result of the trial in that month; positive (1) or negative (0). The file also contains unique ids and a number of factor variables for use in analysis. Here is a simplified example for illustration:

w <- c(101, 0, 0, 0, 1, 1, 1, 5)
x <- c(102, 0, 0, 0, 0, 0, 0, 3)
y <- c(103, 1, 0, 0, 0, 0, 0, 2)
z <- c(104, 1, 1, 1, 0, 0, 0, 2)
dfrm <- data.frame(rbind(w,x,y,z), row.names = NULL)
names(dfrm) <- c("id","jan","feb","mar","apr","may","jun","start")

The trial participants all joined at different times; the final column is an index giving the column in which that participant's first trial result is recorded. Results for months prior to the participant joining are recorded as zeros (as in the first row of the example).

I want to identify the first sequence of three consecutive zeros per participant, and then return the position of the start of that 3-zero sequence; but limiting my search only to the columns since they started the trial (those from the index column onwards).

My approach - and I'm sure there are many - has been to split this into two tasks: writing NAs to those test results that occurred before the participant joined, using a for loop:

for (i in 1:nrow(dfrm)){
if(dfrm$start[i] > 2) 
dfrm[i,2:(dfrm$start[i]-1)] <- NA
}

before using a match loop on the full range of data now that the rogue early zeros have been set to NA:

for (i in 1:nrow(dfrm)){
f <- match(c(0,0,0), dfrm[i,2:7])
dfrm$outputmth[i] <- f[1]
}

dfrm$outputmth <- dfrm$outputmth - (dfrm$start - 2)

Which is successful (I think) in generating my desired output: the first occurrence of 3 successive zeros per participant when active, and NA where no occurrence was found.

This involved some clunky workarounds; in particular the second loop returning a list of 3 values in f from which I have to select only the first item to populate dfrm$outputmth. But more importantly, running this code on the full data set has taken around 30mins to execute. So, feeling a little embarassed, I'm hoping there is at least one more efficient way to write and run this?

Many thanks for any assistance.

Lance Lee
  • 3
  • 3
  • 1
    Have a look at run length encoding ('?rle'). – jlhoward Aug 14 '14 at 20:23
  • There is way too much text here. I personally can't read this without falling asleep. Just leave the data you priveded, erase all the text including the `for` loops and provide the desired output and some **short** explanation – David Arenburg Aug 14 '14 at 20:34
  • @DavidArenburg It could be more concise, but I mostly found the detail helpful, and the question well-written. It is clear that the author took the time to carefully reread and format the question properly. – nograpes Aug 14 '14 at 20:40
  • this answer is related and could be helpful: http://stackoverflow.com/questions/15150780/find-consecutive-sequence-of-zeros-in-r it also makes use of rle – bjoseph Aug 14 '14 at 20:41

1 Answers1

1

I don't think that what you have written already should give the correct result... Because match(c(0, 0, 0), ...) won't match the first three consecutive zeros but rather give the first match of zero repeated three times. In general you should try to avoid for loops that iterate over the rows of a data frame because they tend to be slow (e.g. if you are altering the contents of the data frame in the body of the loop this causes copies to be created). A workaround is using apply to go over the rows of the data frame and using the function rle to check whether there are three consecutive zeros

dfrm$outputmth <- apply(dfrm[-1], 1, function(x) {
    y <- rle(x[x[7]:6])
    z <- y$values == 0 & y$lengths >= 3
    i <- which(z)[1]
    if (is.na(i)) return(NA)
    if (i == 1) return(x[7])
    return(sum(y$lengths[1:(i-1)]) + x[7])
})

dfrm
#  id jan feb mar apr may jun start outputmth
# 101   0   0   0   1   1   1     5        NA
# 102   0   0   0   0   0   0     3         3
# 103   1   0   0   0   0   0     2         2
# 104   1   1   1   0   0   0     2         4
konvas
  • 14,126
  • 2
  • 40
  • 46
  • I think this is giving the position of the sequence rather than the column position - hence 1, 2 for rows 3 & 4 instead of the expected 2, 4. But `apply` and `rle` looks the way to go, many thanks. – Lance Lee Aug 14 '14 at 21:48
  • I see, I missed the fact that you wanted the column position instead. I have edited the post. I am not sure why you say that the column position should be 2 for row 3 though. I think it should be 3, i.e. the column corresponding to February, since this is when the individual started the trial. – konvas Aug 15 '14 at 08:10
  • Thank you - there are two reasons; I was treating `dfrm$start` as having already disregarded the first column, so `x <- x[x[8]:7]` would solve that issue in this case. – Lance Lee Aug 15 '14 at 18:13
  • The second is best explained with this example: `1 1 0 0 1 1 0 0 0` with `start == 1`. In this case my desired answer is 7 - the start of the 3 consecutive zeros. Since `rle` is counting sequences, `FALSE FALSE FALSE TRUE`, the result will be 4, and adding the value from `start` does not solve it. Somehow I think I need to relate the values in `lengths` to `values` - i.e. aggregate `lengths` for initial multiple returns of `FALSE` before the first `TRUE` in `values` - but can't see that `rle` allows this? – Lance Lee Aug 15 '14 at 18:18
  • You're right I see where it can go wrong now.. Have a look at the updated answer and let me know if it works for you. You can just sum up all the lengths of `rle` up to the first position of the three zero match – konvas Aug 15 '14 at 22:01