0

I have a time series (or simply a vector) that is binary, returning 0 or 1's depending on some condition (generated with ifelse). I would like to be able to return the counts (in this case corresponds to time series, so days) in between the 1's.

I can do this very easily in Excel, by simply calling the Column I am trying to calculate and then adding the row above (if working with Ascending data, or calling row below if working with descending). See below

I tried doing something similar in R but I am getting an error.

DaysBetweenCondition1 = as.numeric(ifelse((Condition1 ==0 ),0,lag(DaysBetweenCondition1)+1))

Is there an easier way to do this besides making a function

   Row# Date Condition1 DaysBetweenCondition1
 1  5/2/2007 NA NA
2 5/3/2007 NA NA
3 5/4/2007 NA NA
4 5/5/2007 NA NA
5 5/6/2007 0 NA
6 5/7/2007 0 NA
7 5/8/2007 0 NA
8 5/9/2007 0 NA
9 5/10/2007 0 NA
10 5/11/2007 0 NA
11 5/12/2007 0 NA
12 5/13/2007 0 NA
13 5/14/2007 1 0
14 5/15/2007 0 1
15 5/16/2007 0 2
16 5/17/2007 0 3
17 5/18/2007 0 4
18 5/19/2007 0 5
19 5/20/2007 0 6
20 5/21/2007 0 7
21 5/22/2007 1 0
22 5/23/2007 0 1
23 5/24/2007 0 2
24 5/25/2007 0 3
25 5/26/2007 0 4
26 5/27/2007 1 0
27 5/28/2007 0 1
28 5/29/2007 0 2
29 5/30/2007 1 0

(fwiw, the Dates in this example are made up, in the real data I am using business days so a bit different, and I dont want to reference them, just put in for clarity)

AbeeCrombie
  • 597
  • 1
  • 5
  • 22

3 Answers3

2

This gets the counting done in one line. Borrowing PhiSeu's code and a line from How to reset cumsum at end of consecutive string and modifying it to count zeros:

# Example
df_date <- cbind.data.frame(c(1:20),
                        c(rep("18/08/2016",times=20)),
                        c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
                        ,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")

# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)

df_date$DaysBetweenCondition1<-sequence(rle(!df_date$Condition1)$lengths) * !df_date$Condition1
Community
  • 1
  • 1
Brian O'Donnell
  • 1,836
  • 19
  • 29
1

R is very good when working with rows that don't depend on each other. Therefore a lot of functions are vectorized. When working with functions that depend on the value of other rows it is not so easy.

At the moment I can only provide you with a solution using a loop. I assume there is a better solution without a loop.

# Example
df_date <- cbind.data.frame(c(1:20),
                            c(rep("18/08/2016",times=20)),
                            c(rep(NA,times=5),0,1,0,0,1,0,0,0,0,1,1,0,1,0,0)
                            ,stringsAsFactors=FALSE)
colnames(df_date) <- c("Row#","Date","Condition1")

# add the new column with 0 as default value
DaysBetweenCondition1 <- c(rep(0,nrow(df_date)))
# bind column to dataframe
df_date <- cbind(df_date,DaysBetweenCondition1)

# loop over rows
for(i in 1:nrow(df_date)){

  if(is.na(df_date$Condition1[i])) {
    df_date$DaysBetweenCondition1[i] <- NA
  } else if(df_date$Condition1[i]==0 & is.na(df_date$Condition1[i-1])) {
    df_date$DaysBetweenCondition1[i] <- NA
  } else if(df_date$Condition1[i]==0) {
    df_date$DaysBetweenCondition1[i] <- df_date$DaysBetweenCondition1[i-1]+1
  } else {
    df_date$DaysBetweenCondition1[i] <- 0
  }

}
PhiSeu
  • 301
  • 2
  • 9
0

Here's a solution that should be relatively fast

f0 = function(x) {
    y = x                        # template for return value
    isna = is.na(x)              # used a couple of times
    grp = cumsum(x[!isna])       # use '1' to mark start of each group
    lag = lapply(tabulate(grp + 1), function(len) {
        seq(0, length.out=len)   # sequence from 0 to len-1
    })
    split(y[!isna], grp) <- lag  # split y, set to lag element, unsplit
    data.frame(x, y)
}

A faster version avoids the lapply() loop; it creates a vector along x (seq_along(x)) and an offset vector describing how the vector along x should be corrected based on the start value of the original vector

f1 = function(x0) {
    y0 = x0
    x = x0[!is.na(x0)]
    y = seq_along(x)
    offset = rep(c(1, y[x==1]), tabulate(cumsum(x) + 1))
    y0[!is.na(y0)] = y - offset
    data.frame(x0, y)
}

Walking through the first solution, here's some data

> set.seed(123)
> x = c(rep(NA, 5), rbinom(30, 1, .15))
> x
 [1] NA NA NA NA NA  0  0  0  1  1  0  0  1  0  0  1  0  0  0  0  1  0  0  0  1
[26]  1  0  0  1  0  0  0  0  0  0

use cumsum() to figure out the group the non-NA data belong to

> isna = is.na(x)
> grp = cumsum(x[!isna])
> grp
 [1] 0 0 0 1 2 2 2 3 3 3 4 4 4 4 4 5 5 5 5 6 7 7 7 8 8 8 8 8 8 8

use tabulate() to figure out the number of elements in each group, lapply() to generate the relevant sequences

> lag = lapply(tabulate(grp + 1), function(len) seq(0, length.out=len))

finally, create a vector to hold the result, and use spilt<- to update with the lag

> y = x
> split(y[!isna], grp) <- lag
> data.frame(x, y)
    x  y
1  NA NA
2  NA NA
3  NA NA
4  NA NA
5  NA NA
6   0  0
7   0  1
8   0  2
9   1  0
10  1  0
11  0  1
12  0  2
13  1  0
14  0  1
15  0  2
16  1  0
17  0  1
...

The key to the second solution is the calculation of the offset. The goal is to be able to 'correct' y = seq_along(x) by the value of y at the most recent 1 in x, kind of like 'fill down' in Excel. The starting values are c(1, y[x==1]) and each needs to be replicated by the number of elements in the group tabulate(cumsum(x) + 1).

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112