Calculate differences between rows faster than a for loop?

Question

I have a data set that looks like this:

ID   |   DATE    | SCORE
-------------------------
123  |  1/15/10  |  10
123  |  1/1/10   |  15
124  |  3/5/10   |  20
124  |  1/5/10   |  30
...

So to load the above snippet as a data frame, the code is:

id<-c(123,123,124,124)
date<-as.Date(c('2010-01-15','2010-01-01','2010-03-05','2010-01-05'))
score<-c(10,15,20,30)
data<-data.frame(id,date,score)

I'm trying to add a column that calculates the "days since last record for this ID".

Right now I'm using a FOR loop that looks something like this:

data$dayssincelast <- rep(NA, nrow(data))
for(i in 2:nrow(data)) {
  if(data$id[i] == data$id[i-1]) 
    data$dayssincelast[i] <- data$date[i] - data$date[i-1]
}

Is there a faster way to do this? (I've looked a bit into APPLY but can't quite figure out a solution besides a FOR loop.)

Thanks in advance!

Please add to your question the output of `dput(head(data))`. Your dates don't look like something you can subtract — GSee, Nov 27 '12 at 19:54
There are many ways to approach the split-apply piece, but all of them will probably end up using `diff`. — joran, Nov 27 '12 at 19:56
@GSee - I did not show it, but I converted the dates already using as.Date(). The above is just dummy data to illustrate the structure. — Dave Guarino, Nov 27 '12 at 22:21
@Dave, you'll get better Answers if you make your Questions [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — GSee, Nov 27 '12 at 23:26
Thank you, @GSee - I've edited the question to make it reproducible. (I'm new to R on SO, so appreciate the pointer! :D ) — Dave Guarino, Nov 28 '12 at 02:29

nograpes · Accepted Answer · 2012-11-28T19:16:18.490

5

This should work if your the dates are in order within id.

id<-c(123,123,124,124)
date<-as.Date(c('2010-01-15','2010-01-01','2010-03-05','2010-01-05'))
score<-c(10,15,20,30)
data<-data.frame(id,date,score)

data <- data[order(data$id,data$date),]
data$dayssincelast<-do.call(c,by(data$date,data$id,function(x) c(NA,diff(x))))
# Or, even more concisely
data$dayssincelast<-unlist(by(data$date,data$id,function(x) c(NA,diff(x))))

edited Nov 28 '12 at 19:16

answered Nov 27 '12 at 20:01

nograpes

18,623
1
44
67

(No change. Sorry about that.) – Matthew Lundberg Nov 28 '12 at 04:35

score 0 · Answer 2 · answered Nov 27 '12 at 21:14

How does the following work for you?

 indx <- which(data$id == c(data$id[-1], NA))
 data$date[indx] - data$date[indx+1]

This just shifts the id's by 1 and compares them to id to check for neighboring matches.
Then for the dat subtraction, simply subtract the matches from the date of the subsequent row.

score 0 · Answer 3 · answered Nov 28 '12 at 03:01

In the case where you need a more complex formula, you can use aggregate:

a <- aggregate(date ~ id, data=data, FUN=function(x) c(NA,diff(x)))
data$dayssincelast <- c(t(a[-1]), recursive=TRUE) # Remove 'id' column

The same sort order applies here as in @nograpes answer.

Calculate differences between rows faster than a for loop?

3 Answers3

Linked

Related