-1

I have a data table DT with 15,000+ rows. I have a for loop that runs correctly, but it takes 30+ seconds and is the slowest part of the code overall. Here is the for loop:

for (i in 2:nrow(DT)) {
 if(DT$C1[i] == DT$C1[i+1] & DT$C2[i] != DT$C2[i+1] & DT$C3[i+1] - DT$C3[i] <= 4 & DT$C2[i] == "Short" & DT$C2[i+1] != "Long") DT$C4[i] = 1 else 
  if(DT$C1[i] == DT$C1[i-1] & DT$C2[i] != DT$C2[i-1] & DT$C3[i] - DT$C3[i-1] <= 4 & DT$C2[i] == "Short" & DT$C2[i-1] != "Long") DT$C4[i] = 1 else
      0 }

Is there any way to speed this up? I have seen some examples here and elsewhere, but they don't quite solve my particular issue with i+1 and i-1.

Here is some example data.

C1 <- c("1","1","1","1","1","2","2","2","3","3","3","3","3","3","3","3","4","4","4","4","4","4","4","4","4","4","4","4","4")
C2 <- c("Short","Short","Short","None","None","Short","Short","Other","Long","Long","Long","Long","Long","Long","Long","Long","Short","Short","Other","Short","Short","None","Short","None","Other","Short","Short","Short","Short")
C3 <- c(as.Date("2010-06-01"),as.Date("2010-06-05"),as.Date("2010-06-09"),as.Date("2010-06-13"),as.Date("2010-06-17"),as.Date("2010-06-02"),as.Date("2010-06-21"),as.Date("2010-07-09"),as.Date("2010-07-13"),as.Date("2010-07-17"),as.Date("2010-07-21"),as.Date("2010-08-01"),as.Date("2010-08-05"),as.Date("2010-08-09"),as.Date("2010-09-03"),as.Date("2010-09-07"),as.Date("2010-06-03"),as.Date("2010-06-07"),as.Date("2010-06-11"),as.Date("2010-06-14"),as.Date("2010-06-17"),as.Date("2010-06-21"),as.Date("2010-06-24"),as.Date("2010-06-27"),as.Date("2010-07-01"),as.Date("2010-07-05"),as.Date("2010-07-09"),as.Date("2010-07-13"),as.Date("2010-07-17"))

DF <- data.frame(C1=C1, C2=C2, C3=C3)
DT <- as.data.table(DF)

And the desired output.

C1  C2      C3          C4
1   Short   2010-06-01  0
1   Short   2010-06-05  0
1   Short   2010-06-09  1
1   None    2010-06-13  0
1   None    2010-06-17  0
2   Short   2010-06-02  0
2   Short   2010-06-21  0
2   Other   2010-07-09  0
3   Long    2010-07-13  0
3   Long    2010-07-17  0
3   Long    2010-07-21  0
3   Long    2010-08-01  0
3   Long    2010-08-05  0
3   Long    2010-08-09  0
3   Long    2010-09-03  0
3   Long    2010-09-07  0
4   Short   2010-06-03  0
4   Short   2010-06-07  1
4   Other   2010-06-11  0
4   Short   2010-06-14  1
4   Short   2010-06-17  1
4   None    2010-06-21  0
4   Short   2010-06-24  1
4   None    2010-06-27  0
4   Other   2010-07-01  0
4   Short   2010-07-05  1
4   Short   2010-07-09  0
4   Short   2010-07-13  0
4   Short   2010-07-17  0

Thank you for your help.

Community
  • 1
  • 1
  • 1
    Please provide and example data set that this loop will give desired output. – David Arenburg May 13 '15 at 15:52
  • I think you should probably just create a new column or index that represents the lag and lead of your columns in question, and then do away with the loop in favor of a vectorized logic operation. [See this question](http://stackoverflow.com/questions/3558988/basic-lag-in-r-vector-dataframe) for examples on how to do a lag/lead with a `data.frame` object. – Forrest R. Stevens May 13 '15 at 15:58
  • Did you add that second if statement to handle the beginning and end of your vectors? You may be able to replace one of them. – Pierre L May 13 '15 at 15:58
  • Added example data and desired output. Thanks for taking a look everyone. – Andrew Russell May 13 '15 at 17:40

1 Answers1

1

You can vectorize this with something like:

n <- nrow(DT)
DT$C4 <- NA  # Initialize however you want
# Warning -- untested due to no reproducible example...
DT$C4[2:(n-1)] <- as.numeric((DT$C1[2:(n-1)] == DT$C1[3:n] & DT$C2[2:(n-1)] != DT$C2[3:n] & DT$C3[3:n] - DT$C3[2:(n-1)] <= 4 & DT$C2[2:(n-1)] == "Short" & DT$C2[3:n] != "Long") |
                             (DT$C1[2:(n-1)] == DT$C1[1:(n-2)] & DT$C2[2:(n-1)] != DT$C2[1:(n-2)] & DT$C3[2:(n-1)] - DT$C3[1:(n-2)] <= 4 & DT$C2[2:(n-1)] == "Short" & DT$C2[1:(n-2)] != "Long"))

Basically each time you indexed by i in the loop you replace that with 2:(n-1), each time you indexed with i-1 you replace that with 1:(n-2), and each time you indexed with i+1 you replace that with 3:n.

josliber
  • 43,891
  • 12
  • 98
  • 133