1

I am trying to avoid for loop and use apply instead for post-processing flags that I have detected.

I have a time series with a column showing whether the quality was ok or not. Here's how the dataframe looks like:

n <- 100
tstart <- strptime("12/15/16 16:00:00", "%m/%d/%y %H:%M:%S")
df <- data.frame(Date = tstart + seq(0,n*5-1,5) + sample(seq(0,3,1), n, replace = T),
             Check = sample(c("FLAG", "PASS"), n, replace = T))

# head of df
#         Date           Check
# 1 2016-12-15 16:00:02  FLAG
# 2 2016-12-15 16:00:05  PASS
# 3 2016-12-15 16:00:13  FLAG
# 4 2016-12-15 16:00:17  PASS
# 5 2016-12-15 16:00:22  FLAG
# 6 2016-12-15 16:00:26  FLAG

I don't like to pick up all the FLAGs though. I want to apply three conditions:

1) Disregard flags where the time difference from previous row is more than 60 seconds

2) I'd like to keep flags that have been repeating for a while.

Here's how I am implementing this:

df$Time_Difference <- c(0,as.numeric(diff(df$Date)))
df$Flag_Counter <- 0
desired_rep <- 3
# Start the clock!
ptm <- proc.time()
for (row_index in 2:nrow(df)){
    if (df[row_index, "Time_Difference"] > 60){
        df[row_index, "Flag_Counter"] <- 0
    }
    else {
        if (df[row_index, "Check"] == "PASS"){
            df[row_index, "Flag_Counter"] <- max(0, df[row_index-1, "Flag_Counter"] - 1)
        }
        else {
            df[row_index, "Flag_Counter"] <- min(desired_rep, df[row_index-1, "Flag_Counter"] + 1)
        }
    }
}
# Stop the clock
x <- proc.time() - ptm
print(x[3])

So, really the for loop is getting the flags that have been repeating for desired_rep times in a row. In case we have a PASS after two FLAGs, 1 is Flag_Counter and finally we do df[, df$Flag_Counter == 3] we can the post-processed flags. Now, this is extremely slow. I was wondering if we can use apply to make this task faster. I have done this in Python but I don't know how to access previous rows in my pre-defined function and then use apply. I appreciate your help.

ahoosh
  • 1,340
  • 3
  • 17
  • 31
  • This is a difficult reproducible example as there are no time differences greater than 60 seconds between rows. Also, what is your desired result? Just a new column, *FlagCounter*? – Parfait Dec 15 '16 at 19:02
  • For examples involving random processes, [please add `set.seed` for reproducibility](http://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function). – Pierre L Dec 15 '16 at 19:31

2 Answers2

2

Try this:

desired_rep = 3

# If Time_Difference > 60, 0, otherwise 1 if "Flag", -1 if "Pass"
df$temp = ifelse(df$Check=='FLAG',1,-1)*(df$Time_Difference<=60)

# Do a "cumsum" that's bounded between 0 and 3, and resets to 0 if Time_Difference > 60
df$Flag_Counter = Reduce(function(x,y) max(0, min(desired_rep,x+y))*(y!=0), df$temp, acc=T)

In general, Reduce() is useful when you need to update a "state" sequentially, with the limitation that the input is a single list/vector (here, the temp column).

sirallen
  • 1,947
  • 14
  • 21
  • Great. My original for loop takes 180 seconds to complete on a 200k row data frame. Your method does it in less than a second! Is the key for superb performance `Reduce` function? What does `acc=T` do here? Thanks. – ahoosh Dec 15 '16 at 21:14
  • 1
    It's costly to access elements of a data.frame by looping through the rows (not sure what the reason is, to be honest, but compare `for (i in 1:1000000) {}` and `d = data.frame(x=1:1000000); for (i in 1:1000000) {d[i,]}`). If you take a look at the source code for `Reduce`, you'll see that it also uses for-loops, but over lists, which is more efficient. (`e = as.list(1:1000000); for(i in 1:1000000) {e[[i]]}`) – sirallen Dec 15 '16 at 22:00
  • 1
    `accumulate=T` causes `Reduce` to return all intermediate results, so you get a vector of the same length as `x`. – sirallen Dec 15 '16 at 22:01
1

Give this a try:

n <- 100
tstart <- strptime("12/15/16 16:00:00", "%m/%d/%y %H:%M:%S")
df <- data.frame(Date = tstart + seq(0,n*5-1,5) + sample(seq(0,3,1), n, replace = T),
                 Check = sample(c("FLAG", "PASS"), n, replace = T))

desired_rep <- 3 #set the desired repetition limit

The time you used in the example code was End_Time. I am assuming this should be the Date from the original data set?

df$Time_Difference <- c(0,as.numeric(diff(df$Date)))

Find the consecutive flags. Thanks to this post.

df$consecutive_flag_count <- sequence(rle(as.character(df$Check))$lengths)

Create a check_again column which will return OK if the Check is Pass or the Time_Difference is less than 60 and there are fewer than desired_rep consecutive Check.

df$check_again <- ifelse(df$Check == "PASS", "OK", 
 ifelse(df$Time_Difference < 60 & df$consecutive_flag_count >= desired_rep, "CHECK_AGAIN","OK"))

You can then easily filter to the CHECK_AGAIN items as follows.

df_check_again <- df[df$check_again == "CHECK_AGAIN", ]
> df_check_again
                  Date Check Time_Difference consecutive_flag_count check_again
3  2016-12-15 16:00:11  FLAG               4                      3 CHECK_AGAIN
4  2016-12-15 16:00:18  FLAG               7                      4 CHECK_AGAIN
17 2016-12-15 16:01:23  FLAG               5                      3 CHECK_AGAIN
18 2016-12-15 16:01:26  FLAG               3                      4 CHECK_AGAIN
19 2016-12-15 16:01:30  FLAG               4                      5 CHECK_AGAIN
20 2016-12-15 16:01:37  FLAG               7                      6 CHECK_AGAIN
27 2016-12-15 16:02:10  FLAG               3                      3 CHECK_AGAIN
28 2016-12-15 16:02:18  FLAG               8                      4 CHECK_AGAIN
29 2016-12-15 16:02:20  FLAG               2                      5 CHECK_AGAIN
42 2016-12-15 16:03:27  FLAG               4                      3 CHECK_AGAIN
43 2016-12-15 16:03:33  FLAG               6                      4 CHECK_AGAIN
44 2016-12-15 16:03:38  FLAG               5                      5 CHECK_AGAIN
55 2016-12-15 16:04:33  FLAG               7                      3 CHECK_AGAIN
56 2016-12-15 16:04:36  FLAG               3                      4 CHECK_AGAIN
57 2016-12-15 16:04:41  FLAG               5                      5 CHECK_AGAIN
58 2016-12-15 16:04:45  FLAG               4                      6 CHECK_AGAIN
85 2016-12-15 16:07:02  FLAG               7                      3 CHECK_AGAIN
> 
Community
  • 1
  • 1
Nick Criswell
  • 1,733
  • 2
  • 16
  • 32
  • Thanks for the reply. This is a great solution. I think the following line should be changed this way `df$check_again <- ifelse(df$Check == "PASS", "OK", ifelse(df$Time_Difference < 60 & df$consecutive_flag_count >= desired_rep, "CHECK_AGAIN","OK"))` so that we get the right answer. – ahoosh Dec 15 '16 at 21:32
  • Excellent call, @bikhaab. Updating answer now. – Nick Criswell Dec 15 '16 at 21:34
  • Your solution is very fast. But, I am not getting the same result I got from the for loop. I am investigating why! – ahoosh Dec 15 '16 at 21:54
  • I think `consecutive_flag_count` should not be assessed independent of `Time_Difference`. For example, at the end of a day we might have been getting flags, there is 8 hours without data, and in the morning continuing getting flags. Well, `consecutive_flag_count` puts them al together, however, there is one really big `Time_Difference` in there that breaks the consequence. Apply `df$Time_Difference < 60` later on does not help and should be incorporated at the same time. – ahoosh Dec 15 '16 at 22:07
  • Would adding, `df$consecutive_flag_count_mod <-ifelse(df$Time_Difference > 60, df$consecutive_flag_count <- 1,df$consecutive_flag_count)` to the code and then getting `check_again` from the the `consecutive_flag_count_mod` do the trick then? I apologize for the errors. Happy to delete the response since there is another working solution posted. – Nick Criswell Dec 15 '16 at 22:17
  • I think your answer is fantastic and fast. If there time difference was not an issue, this solution would be the fastest. Please don't erase it as I am sure people will benefit from reading it in the future. – ahoosh Dec 15 '16 at 22:19