1

I have written a loop in R and I would like to make it run a lot faster. The task is to calculate delta values for a time column in a data frame (tibble.) The wrinkle is that each delta should be taken from the previous row whose level column has a value (range 1-9) is greater than or equal to the current row. I need to run this over approximately one billion rows and current performance is substantially below one million rows per second. So I am looking for at least one order of magnitude speed-up.

Here is the code:

ref <- as.numeric(rep(NA, 9)) # separate reference timestamp per level
timedelta <- function(level, time) {
  delta <- time - ref[level]
  ref[1:level] <<- time
  delta
}
mapply(timedelta, tl$level, tl$time)

How do I make that run fast?

(I have asked the same question in the context of dplyr over at How to add flexible delta columns using dplyr? but I did not manage to get the performance I need with dplyr and so I am asking again here.)

Community
  • 1
  • 1
Luke Gorrie
  • 467
  • 3
  • 14
  • 1
    I don't completely understand what you need to do but in a situation where an iteration depends on the result of the previous iteration, I'd try Rcpp – konvas Mar 03 '17 at 11:27
  • Have you tried profiling the code to see where the bottleneck is? – Roman Luštrik Mar 03 '17 at 12:09
  • 1
    For large datasets you could try data.table (which is faster than dplyr), in combination with foreach (which allows you to run the loop in parallel). You will get better results here if you post a reproducible example: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Henk Mar 03 '17 at 13:48

1 Answers1

0

I'm not sure I understand totally what you're doing with the code given, but the best thing is to remove the explicit loop. Something like this:

tl$delta <- tl$time - ref[tl$level]
ref[1:tl$level] <- tl$time

Then sum up your deltas, or whatever operation you need. R doesn't operate well with loops. It likes large matrices like data frames. I'll give you another example. Say I want to calculate which customers in my data frame stayed in my hotel for each day, given their arrival and departure. I could write a loop like this:

days<-seq(as.Date("2016-01-01"), as.Date("2016-12-31"), by="days")
num_guests<-rep(0, length(days))
for(d in c(1:length(num_guests))){
   for(i in c(1:nrow(guests.df))){
      if(guests.df$Arrive_Date[i]<=days[d] &  guests.df$Leave_Date[i]>=days[d] ){
  num_guests[d]=num_guests[d]+1
 }
 }
}

This looping strategy takes 13 minutes to run on an i7 processor with 6700 guests in my data frame. Or I can change it to this:

for(d in c(1:length(num_guests))){
  guests.df$in_period<-guests.df$Arrive_Date<=days[d] & guests.df$Leave_Date>=days[d] 
  num_guests[d]<-sum(guests.df$in_period)
}

The second loop took one second to run.