0

I am trying to find out the connecting transaction. From the first TRUE to last TRUE, its considered as one transaction and also find out in the transaction, the tpt_mode whether is mixed or pure. Then, insert a new column with new data but currently now the for loop is working with little volume of data, when it comes to huge volume of data, it tends to run very slow. How can I optimize the for loop to speed up the performance?

firstid<-1
currTpt <- 'NA'
count<-0
n <- nrow(tnx)
for (i in 1:n) {

  if(tnx$FIRST[i]){

    firstid<-i
    currTpt <-tnx$mode[i]
   count <-1
  }
   else{
   count <- count + 1
  }
  if(as.character(tnx$mode[i])!= as.character(currTpt)){
    currTpt <- 'both'
  }
  if(tnx$LAST[i])
  {
    tnx$final_end_loc[firstid]<-tnx$end_loc[i]    
    tnx$final_end_date[firstid]<-as.character(tnx$end_date[i])  
    tnx$final_end_time[firstid]<-as.character(tnx$end_time[i])
    tnx$final_mode[firstid]<-as.character(currTpt)
    tnx$final_count[firstid] <- count
  }
  }
final_tnx<-subset(tnx,FIRST==TRUE,c("id","start_date","start_time","final_end_date","final_end_time","start_loc","final_end_loc","final_mode","final_count"))

Sample data: EDIT

   tnx<- data.frame(
  id=c("A","A","A","A","C","C","D","D","E"),
  mode=c("on","on","off","on","on","off","off","off","on"),
  start_time=c("8:20:22","17:20:22","17:45:22","18:20:22","16:35:22","17:20:22","15:20:22","16:00:22","12:20:22"),
  end_time=c("8:45:22","17:30:22","18:00:22","18:30:22","17:00:22","17:50:22","15:45:22","16:14:22","27:50:22"),
  start_loc=c("12","12","207","12","11","65","222","32","12"),
  end_loc=c(31,31,29,11,22,12,45,31,11),
  start_date=c("6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012"),
  end_date=c("6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012","6/3/2012"),
  FIRST=c(T,T,F,F,T,F,T,F,T),
  LAST=c(T,F,F,T,F,T,F,T,T)
)

Sample dataset in picture form:

enter image description here

Expected results:

enter image description here

Thanks in advance.

  • What is f? Its missing. – CCurtis Apr 04 '14 at 04:46
  • I am pretty sure this can be done in one line, but I can't figure out what you are trying to do. Can you explain and pretend we don't know what a connecting transaction is...? – beroe Apr 04 '14 at 04:48
  • Sorry, it was a mistake. I amended already. –  Apr 04 '14 at 04:48
  • Why did you delete the same question and post it again? Original question: http://stackoverflow.com/questions/22852046/optimize-r-for-loop – Matthew Lundberg Apr 04 '14 at 04:51
  • This one is the updated one as I already tried with the code and it's a working version one. I closed it because I saw there's votes for closing the thread. –  Apr 04 '14 at 04:53
  • 1
    Close votes or down votes means that you should improve the question, which you did. Close votes will expire. If you've improved the question, and closing is no longer appropriate, more close votes will probably not accumulate. Too late this time, as you've already opened an exact duplicate question and one will be surely closed if you undelete the other, but for next time, improve and don't close. You were doing just fine prior to the delete and re-post. – Matthew Lundberg Apr 04 '14 at 04:56
  • Sure thing, thanks for the advice. I hope this thread will be clearer prior to the previous one and it helps the readers who are solving my problem :) –  Apr 04 '14 at 04:59

2 Answers2

1

To get your results, you don't need a loop. If you check where your transactions start and end and index accordingly your code simplifies to

nLAST <- which(tnx$LAST)
nFIRST <- which(tnx$FIRST)
count <- sapply(1:length(nFIRST),FUN = function(i){nFIRST[i]:nLAST[i]})
mode <- unlist(lapply(count,FUN=function(x){ifelse(length(unique(tnx$mode[x]))==1,
                      as.character(unique(tnx$mode[x])),'both')}))
final_tnx <- data.frame(id = tnx$id[nFIRST],start_date = tnx$start_date[nFIRST],
    start_time = tnx$start_time[nFIRST],final_end_date = tnx$end_date[nLAST],
    final_end_time = tnx$end_time[nLAST], start_loc=tnx$start_loc[nFIRST], 
    final_end_loc = tnx$end_loc[nLAST],final_mode =  mode, 
    final_count = nLAST - nFIRST +1)

This should definitly speed up things and also perform well on larger data sets.

EDIT: When the mode is allowed to change more than once you have to check for the uniqueness on all subsets. In count I build a list of index sequences for each record. Then apply on the index list a function that checks whether there is one or more modes in the subset.

wici
  • 1,681
  • 1
  • 14
  • 21
  • Indexing outside of the loop or no loop is much faster. :) – CCurtis Apr 04 '14 at 16:12
  • @wici Now you compare the mode depends on the first and last record. What if in between the mode is off, let's say ID A and then the mode is on , on, off, on. Then, the mode is still on, it's not both. How ca I solve this problem? Thank you. –  Apr 07 '14 at 01:23
  • @wici I've updated the sample data. Do you have any clues to solve it? –  Apr 07 '14 at 03:57
0

I'm sure there are far more improvements to be made but if you index as little as possible in the loop and specify data as vectors you can see some improvement.

require("rbenchmark")

###Specify data as vectors
FIRST <- tnx$FIRST
mode <- tnx$mode
LAST <- tnx$LAST
final_end_loc <- tnx$final_end_loc
final_end_date <- tnx$final_end_date
final_end_time <- tnx$final_end_time
final_mode <- tnx$final_mode
final_count <- tnx$final_count
end_date <- tnx$end_date
end_time <- tnx$end_time
end_loc <- tnx$end_loc

benchmark(for (i in 1:n) {

   if(FIRST[i]){

    firstid<-i
    currTpt <-mode[i]
    count <-1
}
else{
    count <- count + 1
}
if(as.character(mode[i])!= as.character(currTpt)){
    currTpt <- 'both'
}
if(LAST[i])
{
    final_end_loc[firstid]<-end_loc[i]    
    final_end_date[firstid]<-as.character(end_date[i])  
    final_end_time[firstid]<-as.character(end_time[i])
    final_mode[firstid]<-as.character(currTpt)
    final_count[firstid] <- count
}
})

 replications elapsed relative user.self sys.self user.child sys.child
1          100    0.11        1      0.11        0         NA        NA

Now your loop

   replications elapsed relative user.self sys.self user.child sys.child
1          100    0.18        1      0.19        0         NA        NA

Cannot be certain if this will perform well with large dataset but keeping indexing to a minimum have worked for me in the past. A good post can be found here Speed up the loop operation in R if this isn't fast enough for you or doesn't work well with large data.

Community
  • 1
  • 1
CCurtis
  • 1,770
  • 3
  • 15
  • 25
  • What does the require() do? –  Apr 04 '14 at 06:09
  • 1
    require() is same as library(), which loads the namespace of the package and attaches it, except that it gives a warning rather than an error when package is not found. [require()](http://stat.ethz.ch/R-manual/R-devel/library/base/html/library.html) – gkcn Apr 04 '14 at 07:48
  • By the way there is a nice chapter on performance in Hadley Wickham's [Advanced R Programming](http://adv-r.had.co.nz/Performance.html) – gkcn Apr 04 '14 at 07:51