4

Can you make this R code faster? Can't see how to vectorize it. I have a data-frame as follows (sample rows below):

> str(tt)
'data.frame':   1008142 obs. of  4 variables:
 $ customer_id: int, visit_date : Date, format: "2010-04-04", ...

I want to compute the diff between visit_dates for a customer. So I do diff(tt$visit_date), but have to enforce a discontinuity (NA) everywhere customer_id changes and the diff is meaningless, e.g. row 74 below. The code at bottom does this, but takes >15 min on the 1M row dataset. I also tried piecewise computing and cbind'ing the subresult per customer_id (using which()), that was also slow. Any suggestions? Thanks. I did search SO, R-intro, R manpages, etc.

   customer_id visit_date visit_spend ivi
72          40 2011-03-15       18.38   5
73          40 2011-03-20       23.45   5
74          79 2010-04-07      150.87  NA
75          79 2010-04-17      101.90  10
76          79 2010-05-02      111.90  15

Code:

all_tt_cids <- unique(tt$customer_id)

# Append ivi (Intervisit interval) column
tt$ivi <- c(NA,diff(tt$visit_date))
for (cid in all_tt_cids) {
  # ivi has a discontinuity when customer_id changes
  tt$ivi[min(which(tt$customer_id==cid))] <- NA
}

(Wondering if we can create a logical index where customer_id differs to the row above?)

smci
  • 32,567
  • 20
  • 113
  • 146

1 Answers1

6

to set NA to appropriate places, you again can use diff() and one-line trick:

> tt$ivi[c(1,diff(tt$customer_id)) != 0] <- NA

explanation

let's take some vector x

x <- c(1,1,1,1,2,2,2,4,4,4,5,3,3,3)

we want to extract such indexes, which start with new number, i.e. (0,5,8,11,12). We can use diff() for that.

y <- c(1,diff(x))
# y = 1  0  0  0  1  0  0  2  0  0  1 -2  0  0

and take those indexes, that are not equal to zero:

x[y!=0] <- NA
Max
  • 4,792
  • 4
  • 29
  • 32
  • Awesome! You meant `tt$ivi[c(1,diff(tt$customer_id)) != 0] <- NA` – smci Oct 31 '11 at 07:22
  • Sure I get the idea, I just don't get why this should be >1000x faster than the sequential access. – smci Oct 31 '11 at 07:29
  • in R whenever you can get rid of loops, do it. R is very vectorized, you should code in R with vectors, not in separate numbers and iterating them. – Max Oct 31 '11 at 07:32
  • Yes, I knew that, and was trying to vectorize this. But 1000x slower is ridiculous? – smci Oct 31 '11 at 07:50
  • 1
    @smci, there are several discussions in SO about why R loops are slower than vectorization. for ex. take a look at http://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r – Max Oct 31 '11 at 08:17
  • 1
    @smci The bottleneck in your code is the assignment using `<-` to a `data.frame`, which is very slow. You should be able to get a substantial performance improvement by using `lapply` rather than a loop. (Note that I am not saying `lapply` is faster than loops in general. It's not `lapply` that is faster, but avoiding `'<-.data.frame'` during each iteration.) – Andrie Oct 31 '11 at 08:17