0

I've got a data set, called vistsPerDay, that looks like this but with 405,890 rows and 10,406 unique CUST_ID:

> CUST_ID   Date
> 1         2013-09-19
> 1         2013-10-03
> 1         2013-10-08
> 1         2013-10-12
> 1         2013-10-20
> 1         2013-10-25
> 1         2013-11-01
> 1         2013-11-02
> 1         2013-11-08
> 1         2013-11-15
> 1         2013-11-23
> 1         2013-12-02
> 1         2013-12-04
> 1         2013-12-09
> 2         2013-09-16
> 2         2013-09-17
> 2         2013-09-18

What I'd like to do is create a new variable that is the lagged difference between the dates in their visits. Here is the code I'm currently using:

visitsPerDay <- visitsPerDay[order(visitsPerDay$CUST_ID), ]
cust_id <- 0
for (i in 1:nrow(visitsPerDay)) {
  if (visitsPerDay$CUST_ID[i] != cust_id) {
    cust_id <- visitsPerDay$CUST_ID[i]
    visitsPerDay$MTBV <- NA
  } else {
    visitsPerDay$MBTV <- as.numeric(visitsPerDay$Date[i] - visitsPerDay$Date[i-1])
  }
}

I feel like this is certainly not the most efficient way to do this. Does anyone have a better way to approach it? Thanks!

tblznbits
  • 6,602
  • 6
  • 36
  • 66
  • Unless `order()` is a stable sort, and the original data is sorted by date - the algorithm is wrong. – amit Jan 17 '14 at 15:11
  • I see your point, amit. To ensure that the dates are chronological, I should sort by date as well. Nevertheless, that's not the crux of my problem at the moment. I'm currently running the algorithm listed above and I've already passed a 5-minute run time. – tblznbits Jan 17 '14 at 15:14
  • I would suggest you do some reading up on loops in R, which are generally avoided. See http://yihui.name/en/2010/10/on-the-gory-loops-in-r/ and http://stackoverflow.com/questions/7142767/why-are-loops-slow-in-r – rrs Jan 17 '14 at 16:02

3 Answers3

1

Here's the data.table solution. This will likely be much faster and is more readable:

dt = data.table(visitsPerDay)

dt[, MBTV := c(NA, diff(as.Date(Date))), by = CUST_ID]
dt
#    CUST_ID       Date    MBTV
# 1:       1 2013-09-19 NA days
# 2:       1 2013-10-03 14 days
# 3:       1 2013-10-08  5 days
# 4:       1 2013-10-12  4 days
# 5:       1 2013-10-20  8 days
# 6:       1 2013-10-25  5 days
# 7:       1 2013-11-01  7 days
# 8:       1 2013-11-02  1 days
# 9:       1 2013-11-08  6 days
#10:       1 2013-11-15  7 days
#11:       1 2013-11-23  8 days
#12:       1 2013-12-02  9 days
#13:       1 2013-12-04  2 days
#14:       1 2013-12-09  5 days
#15:       2 2013-09-16 NA days
#16:       2 2013-09-17  1 days
#17:       2 2013-09-18  1 days
eddi
  • 49,088
  • 6
  • 104
  • 155
0

Here's an approach with tapply:

# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date) 

visitsPerDay <- transform(visitsPerDay, 
                          MBTV = unlist(tapply(Date, 
                                               CUST_ID, 
                                               FUN = function(x) c(NA,diff(x)))))

The result:

    CUST_ID       Date MBTV
11        1 2013-09-19   NA
12        1 2013-10-03   14
13        1 2013-10-08    5
14        1 2013-10-12    4
15        1 2013-10-20    8
16        1 2013-10-25    5
17        1 2013-11-01    7
18        1 2013-11-02    1
19        1 2013-11-08    6
110       1 2013-11-15    7
111       1 2013-11-23    8
112       1 2013-12-02    9
113       1 2013-12-04    2
114       1 2013-12-09    5
21        2 2013-09-16   NA
22        2 2013-09-17    1
23        2 2013-09-18    1

Edit: A faster approach:

# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date) 

visitsPerDay$MBTV <- c(NA_integer_, 
                       "is.na<-"(diff(visitsPerDay$Date), 
                                 !duplicated(visitsPerDay$CUST_ID)[-1]))
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • Absolutely perfect. 2.58 seconds to complete. Nice work! – tblznbits Jan 17 '14 at 15:23
  • 2
    @brittenb 2.6 seconds is rather slow. You should be able to do this much faster if you implement it with package data.table. – Roland Jan 17 '14 at 15:44
  • @brittenb I added a (hopefully) faster approach. – Sven Hohenstein Jan 17 '14 at 15:56
  • 1
    @Roland 2.6 seconds may not be the fastest approach, which the title to this post implies I am looking for. So I totally understand the comment. However, I just needed something faster than what I was using. There is a definite tradeoff between speed/efficiency and readability/adaptability since other people need to read the code and understand it. The initial solution that Sven provided was clean, elegant, and took minimal time to execute. Any further tweaking to shed one second from the computing time seems counter productive given the limited number of times this code will be run. – tblznbits Jan 17 '14 at 16:45
0

You can accelerate the process by doing a bucket sort rather than ordinary sort, since you are sorting by the cust_id. Note that the bottleneck in the algorithm (in terms of big O notation) is the sorting, which is O(nlogn).

The following pseudo code assumes the data is given sorted by date (same assumption you need for the suggested code in the answer):

//bucket sort:
customers <- new array of size 10406
for each (cust_id,date):
   if customers[cust_id] == nil:
        customers[cust_id] = []
   customers[cust_id].append(date)
//find differences:
for each list in customers:
   i <- list.iter()
   prev = i.next()
   while (i.hasNext()):
        curr <- i.next()
        output diff(prev,curr)
        prev <- curr

The above code runs in O(n), which is theoretically better than your approach (for large enough inputs), at the cost of more memory consumption.

amit
  • 175,853
  • 27
  • 231
  • 333