0

I am trying to optimize my R code by removing nested for loop with vectorization. My nested for loop include rbind based on if condition. Nested for loop code works however, when running vectorized code using rbind, doesn't fill the new dataframe.

For the background, I have two dataframes-'ip' , 'ip_error'. Data frame ‘ip’ with Dimension is ‘469 5’. Data frame ‘ip_error’ is with Dimension is ‘9 11’. After comparison of two data frames on the specific columns of task start and end with session start and end, my output is the selected rows from data frame ‘ip’.

This is my working code with nested for loop

for(j in 1:length(ip$RUID_KEY)){
 for(i in 1:length(ip_error$RUID_KEY)){
  if(isTRUE(ip_error$RUID_KEY[i]==ip$RUID_KEY[j]&&ip_error$TASK_START[i]>=ip$sess_start[j]&&ip_error$TASK_END[i]<ip$sess_end[j])){
    ev_ip_error<-rbind(ev_ip_error,ip[j,])
  }
}
}

My code with vectorization is as follows, which does not work

al<-1:length(ip$RUID_KEY)
bl<-1:length(ip_error$RUID_KEY)

f<- function(i,j){
  if(isTRUE(ip_error$RUID_KEY[i]==ip$RUID_KEY[j]&&ip_error$TASK_START[i]>=ip$sess_start[j]&&ip_error$TASK_END[i]<ip$sess_end[j])){
    ev_ip_error<-rbind(ev_ip_error,ip[j,])
  }
}

mapply(f,al,bl)

Here is example of my data frames, where for rows 1 and 3 in 'ip_error' satisfy the if condition

No.     RUID_KEY    sess_start  sess_end
1   101 2018-12-01 22:48:18.827 2018-12-01 22:55:18.900
2   201 2018-12-01 13:10:20.100 2018-12-01 13:50:10.000
3   201 2018-12-12 11:10:10.100 2018-12-12 11:20:00.100

‘ip_error’ data frame

No. RUID_KEY    TASK_START  TASK_END    TASK_NAME
1   101 2018-12-01 22:50:18.827 2018-12-01 22:50:18.827 ERROR1
2   101 2018-12-01 15:10:20.100 2018-12-01 15:10:20.100 ERROR2
3   201 2018-12-01 13:40:10.100 2018-12-01 13:40:10.100 ERROR1
ev_ip_error<-data.frame(matrix(ncol=5,nrow=0))
x<-c("RUID_KEY", "sess_start", "sess_end")
colnames(ev_ip_error)<-x
  • 1
    This would be a better question with some short example input data and what you want as an outcome. In particular, are all the values of `ip_error$RUID_KEY` unique? – Henry Feb 27 '20 at 15:37
  • ip_error$RUID_KEY is not unique. I have edited my question with example. – Booma Devi Feb 27 '20 at 16:33
  • Code has to error out as *ev_ip_error* is never defined. Also, `mapply` like other apply family members are hidden loops. You may be applying [code vectorization](https://stackoverflow.com/a/28986505/1422451) but not [calculated vectorization](https://stackoverflow.com/a/29006276/1422451). – Parfait Feb 27 '20 at 16:39
  • ev_ip_error is an empty data frame, which I have defined with column names as same as 'ip' data frame – Booma Devi Feb 27 '20 at 16:42
  • Have you tried extending `f<- function(i,j)` by `f<- function(i,j,ev_ip_error)`? Or use `ev_ip_error<<-` instead of `ev_ip_error<-`? – geoidiot Feb 27 '20 at 16:43
  • I have added the code for initialization of my 'ev_ip_error' dataframe. Actually, my nested for loop code works and rows from 'ip' data frame does get copied into my empty data frame 'ev_ip_error'. I just want to optimize this code to run on large data set. – Booma Devi Feb 27 '20 at 16:54
  • @geoidiot both your suggestion doesn't work – Booma Devi Feb 27 '20 at 16:56
  • There’s no need to put `isTRUE` around a single conditional. The function is only needed to coerce non-`logical` values, or values that might be a vector of length >1, but this can never be the case here. – Konrad Rudolph Feb 27 '20 at 17:46

2 Answers2

0

Consider merge of the two data frames and then subset by time:

ev_ip_error <- subset(merge(ip, ip_error, by="RUID_KEY", suffixes=c("", "_")),
                      TASK_START >= sess_start & TASK_END < sess_end)[names(ip)]

ev_ip_error

#   No. RUID_KEY          sess_start            sess_end
# 1   1      101 2018-12-01 22:48:18 2018-12-01 22:55:18
# 3   2      201 2018-12-01 13:10:20 2018-12-01 13:50:10

Which is equivalent to unadjusted for loop and corrected mapply (or Map) approach that builds a list of data frames with expand.grid (for all possible combinations between RUID_KEY values). Since apply family solutions do not save scoped variables you need to build object outside its loop or call rbind once outside loop. This would be more efficient than for loop. See below:

prms <- expand.grid(al = 1:length(ip$RUID_KEY),
                    bl = 1:length(ip_error$RUID_KEY))

f <- function(i,j){
  if(isTRUE(ip_error$RUID_KEY[i]==ip$RUID_KEY[j] && ip_error$TASK_START[i]>=ip$sess_start[j] && ip_error$TASK_END[i]<ip$sess_end[j])){
     return(ip[j,])
  }
}

df_list <- mapply(f, prms$al, prms$bl, SIMPLIFY = FALSE)
#df_list <- Map(f, prms$al, prms$bl)   # EQUIVALENT

ev_ip_error <- do.call(rbind, df_list)

See comparison of all three approaches in Online Demo.

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks. Yes, the code with merge did work for me. But the one with mapply was filling in Null value in df_list. – Booma Devi Feb 28 '20 at 13:42
  • Hmmm...that would mean nothing met the `if` condition and so `return` was `NULL`. However, if you see the online demo using your exact posted data, that is not the case. – Parfait Feb 28 '20 at 15:20
0

I would suggest using the data.table package and using inner join with inequality conditions. It's fast and straightforward to use once you get used to the syntax.

Here is the setup:

Step 1: create the example dataset:

ip <- data.table::data.table(
  ruid_key = c(101, 201, 201),
  sess_start = as.POSIXct(c(
    '2018-12-01 22:48:18.827',
    '2018-12-01 13:10:20.100',
    '2018-12-12 11:10:10.100'
    )),
  sess_end = as.POSIXct(c(
    '2018-12-01 22:55:18.900',
    '2018-12-01 13:50:10.000',
    '2018-12-12 11:20:00.100')))


ip_error <- data.table::data.table(
  ruid_key = c(101,101,201),
  task_start = as.POSIXct(c(
    '2018-12-01 22:50:18.827',
    '2018-12-01 15:10:20.100',
    '2018-12-01 13:40:10.100'
  )),
  task_end = as.POSIXct(c(
    '2018-12-01 22:50:18.827',
    '2018-12-01 15:10:20.100',
    '2018-12-01 13:40:10.100'
  ))
)

Step 2. do inner join, add inequalities directly to the on condition in the join

ip[ip_error, 
   on = c('ruid_key', 'sess_start<=task_start', 'sess_end>task_end'),
   .(sess_start = x.sess_start, sess_end = x.sess_end),
   nomatch = NULL
   ]
ira
  • 2,542
  • 2
  • 22
  • 36
  • Your code did work on a small sample on R studio, but the same gave me error when run on jupyter notebook on a larger data set. The error is: Error in `[.data.frame`(ip, ip_error, on = c("RUID_KEY", "sess_start<=TASK_START", : unused arguments (on = c("RUID_KEY", "sess_start<=TASK_START", "sess_end>TASK_END"), nomatch = NULL) Traceback: 1. ip[ip_error, on = c("RUID_KEY", "sess_start<=TASK_START", "sess_end>TASK_END"), . .(RUID_KEY = x.RUID_KEY, EVENT_SPEC_1 = x.EVENT_SPEC_1, sess_start = x.sess_start, . sess_end = x.sess_end, sess_dur = x.sess_dur), nomatch = NULL] – Booma Devi Feb 28 '20 at 11:11
  • You need to turn the data.frame into data.table – ira Feb 28 '20 at 11:45
  • Yes, you are right, it works now. Also it is fast on large data set. – Booma Devi Feb 28 '20 at 13:40