0

I am attempting to append a sequence number to a data frame grouped by individuals and date. For example, to turn this:

   x          y
1  A 2012-01-02
2  A 2012-02-03
3  A 2012-02-25
4  A 2012-03-04
5  B 2012-01-02
6  B 2012-02-03
7  C 2013-01-02
8  C 2012-02-03
9  C 2012-03-04
10 C 2012-04-05

in to this:

   x          y v
1  A 2012-01-02 1
2  A 2012-02-03 2
3  A 2012-02-25 3
4  A 2012-03-04 4
5  B 2012-01-02 1
6  B 2012-02-03 2
7  C 2013-01-02 1
8  C 2012-02-03 2
9  C 2012-03-04 3
10 C 2012-04-05 4

where "x" is the individual, "y" is the date, and "v" is the appended sequence number

I have had success on a small data frame using a for loop in this code:

x=c("A","A","A","A","B","B","C","C","C","C")
y=as.Date(c("1/2/2012","2/3/2012","2/25/2012","3/4/2012","1/2/2012","2/3/2012",
"1/2/2013","2/3/2012","3/4/2012","4/5/2012"),"%m/%d/%Y")
x
y
z=data.frame(x,y)

z$v=rep(1,nrow(z))

for(i in 2:nrow(z)){
    if(z$x[i]==z$x[i-1]){
    z$v[i]=(z$v[i-1]+1)
    } else {
    z$v[i]=1
    }
}

but when I expand this to a much larger data frame (250K+ rows) the process takes forever.

Any thoughts on how I can make this more efficient?

Will Phillips
  • 805
  • 2
  • 10
  • 20
  • You may have a look at `data.table` and `dplyr` alternatives [**here**](http://stackoverflow.com/questions/22288462/replace-loop-with-an-pply-alternative) (in addtion to good old `ave`) – Henrik Apr 08 '14 at 17:26
  • Thank you for the feedback. This got me searching in a new direction and I was able to put something together that worked much faster then my for loop but it didn't work as well as Richard Scriven's response below. Thanks again. – Will Phillips Apr 08 '14 at 20:34
  • With all due respect to the answer of Richard, but I tend to believe that a `data.table` solution will be faster than `unlist(sapply(sapply(split` – Henrik Apr 08 '14 at 20:40
  • Henrik, you are absolutely correct. data.table has an extremely efficient solution to this problem. It took some digging and testing but I came across a data.table solution that can append the sequence number for 250K+ records in less than a second. – Will Phillips Apr 18 '14 at 13:48

2 Answers2

1

This seems to work. May be overkill though.

## code needed revision - this is old code
## > d$v <- unlist(sapply(sapply(split(d, d$x), nrow), seq))

EDIT

I can't believe I got away with that ugly mess for so long. Here's a revision. Much simpler.

## revised 04/24/2014
> d$v <- unlist(sapply(table(d$x), seq))
> d
##    x          y v
## 1  A 2012-01-02 1
## 2  A 2012-02-03 2
## 3  A 2012-02-25 3
## 4  A 2012-03-04 4
## 5  B 2012-01-02 1
## 6  B 2012-02-03 2
## 7  C 2013-01-02 1
## 8  C 2012-02-03 2
## 9  C 2012-03-04 3
## 10 C 2012-04-05 4

Also, an interesting one is stack. Take a look.

> stack(sapply(table(d$x), seq))
##    values ind
## 1       1   A
## 2       2   A
## 3       3   A
## 4       4   A
## 5       1   B
## 6       2   B
## 7       1   C
## 8       2   C
## 9       3   C
## 10      4   C
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Hello Richard, Thank you very much. I came up with another solution using the plyr package. I did some testing and your solution is faster. – Will Phillips Apr 08 '14 at 21:01
0

I'm removing my previous post and replacing it with this solution. Extremely efficient for my purposes.

# order data
z=z[order(z$x,z$y),]

#convert to data table
dt.z=data.table(z)

# obtain vector of sequence numbers
z$seq=dt.z[,1:.N,"x"]$V1

The above can be accomplished in fewer steps but I wanted to illustrate what I did. This is appending sequence numbers to my data sets of over 250k records in under a second. Thanks again to Henrik and Richard.

Will Phillips
  • 805
  • 2
  • 10
  • 20