0

This loop is workable for small amount of data but when it comes to huge volume of data, it took quite long for looping. So I want to find out is there any alternate way to do it so it can help to speed up the process time by using R programming:

#set correction to the transaction
mins<-45
for (i in 1:nrow(tnx)) {
 if(tnx$id[i] == tnx$id[i+1]){
    #check trip within 45 mins
    if(tnx$diff[i]>=mins){
        tnx$FIRST[i+1] <- TRUE
        tnx$LAST[i] <- TRUE
    }
 }
 else{
        tnx$LAST[i]<-TRUE
     }
 }

Thanks in advance.

EDIT

enter image description here

What I am trying to do is set the true false value in first and last column by checking the diff column.

Data like:

tnx <- data.frame(
  id=rep(c("A","C","D","E"),4:1),
  FIRST=c(T,T,F,F,T,F,F,T,F,T),
  LAST=c(T,F,F,T,F,F,T,F,T,T),
  diff=c(270,15,20,-1,5,20,-1,15,-1,-1)
)

EDIT PORTION FOR @thelatemail

#   id diff FIRST  LAST
#1   A  270 TRUE  TRUE
#2   A   15  TRUE FALSE
#3   A   20 FALSE FALSE
#4   A   -1 FALSE TRUE
#5   C    5 TRUE  FALSE
#6   C   20 FALSE FALSE
#7   C   -1 FALSE TRUE
#8   D   15 TRUE  FALSE
#9   D   -1 FALSE TRUE
#10  E   -1 TRUE  TRUE
  • 1
    You should give some sample data and explain what you are trying to do. – CHP Apr 03 '14 at 04:33
  • @ChinmayPatil See edited portion. Thank you. –  Apr 03 '14 at 04:39
  • 3
    Please don't post screenshots of data, they are usually useless for actually testing code. Use `dput(head(tnx))` or something similar instead. Are you just trying to find the first and last case in each `id` group? – thelatemail Apr 03 '14 at 04:43
  • @thelatemail yeah, if the diff column is bigger than 45, then the next first column will be true and the previous last column will be also true. –  Apr 03 '14 at 06:06
  • For single loops if your loop is slow it is likely because of the way it is constructed. Poorly constructed loops can slow exponentially with larger data sets. Please read this post on how to speed up the for loop: http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r If you need to use nested loops then you should look at the sapply family of functions. – CCurtis Apr 03 '14 at 06:21

2 Answers2

1

This solves the problem just about as fast as R can do it. You'll note that the meat and potatoes is 4 lines and there are no loops of any kind. I first test id against a version of itself shifted by one position so that the single test gets all of the positions where id[i] == id[i+1] all at once. After that I just use that logical vector to select, or assist in selecting the values in LAST and TRUE that I want to change.

# First I reset the LAST and FIRST columns and set some variables up.
# Note that if you're starting from scratch with no FIRST column at all then 
# you don't need to declare it here yet
tnx$FIRST <- FALSE
tnx$LAST <- FALSE
mins <- 45
n <- nrow(tnx)
# and this is all there is to it
idMatch <- tnx$id == c(as.character(tnx$id[2:n]), 'XX')
tnx$LAST[ idMatch & tnx$diff >= mins] <- TRUE
tnx$LAST[ !idMatch] <- TRUE
tnx$FIRST <- c(TRUE, tnx$LAST[1:(n-1)])
John
  • 23,360
  • 7
  • 57
  • 83
  • What is the 'XX'? Which column does it referred to? –  Apr 03 '14 at 06:03
  • It doesn't refer to any column, it's just a dummy padding the vector being shifted up so that tnx$LAST for the last item is TRUE. Now that your data is posted I could test this and I see I needed a couple of small tweaks to get it to work. It does what you need now very very fast. – John Apr 03 '14 at 12:18
1

Something like this should work: I reset the FIRST and LAST values to make it obvious in this example:

tnx$FIRST <- FALSE
tnx$LAST <- FALSE

The next two parts use ?ave to respectively set tnx$FIRST to TRUE for the first row in each id group, and tnx$LAST to TRUE for the last row in each id group.

tnx$FIRST <- as.logical(
              with(tnx, ave(diff,id,FUN=function(x) seq_along(x)==1) ))
tnx$LAST <- as.logical(
              with(tnx, ave(diff,id,FUN=function(x) seq_along(x)==length(x))))

The final two parts then:
- set tnx$LAST to TRUE when tnx$diff is >=45.
- set tnx$FIRST to TRUE when the previous value for tnx$diff is >=45

tnx$LAST[tnx$diff >= 45] <- TRUE
tnx$FIRST[c(NA,head(tnx$diff,-1)) >= 45] <- TRUE


#   id diff FIRST  LAST
#1   A  270  TRUE  TRUE
#2   A   15  TRUE FALSE
#3   A   20 FALSE FALSE
#4   A   -1 FALSE  TRUE
#5   C    5  TRUE FALSE
#6   C   20 FALSE FALSE
#7   C   -1 FALSE  TRUE
#8   D   15  TRUE FALSE
#9   D   -1 FALSE  TRUE
#10  E   -1  TRUE  TRUE
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • It's actually to find the first and last record for every transaction which will be determined by diff column. For example, the first row and the second row, the diff is 270, then the first row the first column is TRUE and last the column is TRUE. Then, the second row compare with the third row is 15 minutes which is less than 45, so should be TRUE FALSE. Since record 3 and record 4 is all less than 45, the row for the last column will be TRUE. –  Apr 03 '14 at 06:19
  • Check edited portion for desired outcomes. Thank you! –  Apr 03 '14 at 06:22
  • It works but would you mind to briefly explain for the whole portion? Thank you! –  Apr 03 '14 at 06:41
  • @Carol - short explanation provided. – thelatemail Apr 03 '14 at 06:47
  • Technically `ave` is split apply and therefore looping. So, this will be substantially slower on a lot of data than a version with no loops. (which is possible) – John Apr 03 '14 at 12:14