12

I have my data that looks like below:

CustomerID TripDate
1           1/3/2013
1           1/4/2013
1           1/9/2013
2           2/1/2013
2           2/4/2013
3           1/2/2013

I need to create a counter variable, which will be like below:

CustomerID TripDate   TripCounter
1           1/3/2013   1
1           1/4/2013   2 
1           1/9/2013   3
2           2/1/2013   1
2           2/4/2013   2 
3           1/2/2013   1 

Tripcounter will be for each customer.

Frank
  • 66,179
  • 8
  • 96
  • 180

6 Answers6

14

Use ave. Assuming your data.frame is called "mydf":

mydf$counter <- with(mydf, ave(CustomerID, CustomerID, FUN = seq_along))
mydf
#   CustomerID TripDate counter
# 1          1 1/3/2013       1
# 2          1 1/4/2013       2
# 3          1 1/9/2013       3
# 4          2 2/1/2013       1
# 5          2 2/4/2013       2
# 6          3 1/2/2013       1

For what it's worth, I also implemented a version of this approach in a function included in my "splitstackshape" package. The function is called getanID:

mydf <- data.frame(IDA = c("a", "a", "a", "b", "b", "b", "b"),
                   IDB = c(1, 2, 1, 1, 2, 2, 2), values = 1:7)
mydf
# install.packages("splitstackshape")
library(splitstackshape)
# getanID(mydf, id.vars = c("IDA", "IDB"))
getanID(mydf, id.vars = 1:2)
#   IDA IDB values .id
# 1   a   1      1   1
# 2   a   2      2   1
# 3   a   1      3   2
# 4   b   1      4   1
# 5   b   2      5   1
# 6   b   2      6   2
# 7   b   2      7   3

As you can see from the example above, I've written the function in such a way that you can specify one or more columns that should be treated as ID columns. It checks to see if any of the id.vars are duplicated, and if they are, then it generates a new ID variable for you.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
9

You can also use plyr for this (using @AnadaMahto's example data):

> ddply(mydf, .(IDA), transform, .id = seq_along(IDA))
  IDA IDB values .id
1   a   1      1   1
2   a   2      2   2
3   a   1      3   3
4   b   1      4   1
5   b   2      5   2
6   b   2      6   3
7   b   2      7   4

or even:

> ddply(mydf, .(IDA, IDB), transform, .id = seq_along(IDA))
  IDA IDB values .id
1   a   1      1   1
2   a   1      3   2
3   a   2      2   1
4   b   1      4   1
5   b   2      5   1
6   b   2      6   2
7   b   2      7   3

Note that plyr does not have a reputation for being the quickest solution, for that you need to take a look at data.table.


Here's a data.table approach:

library(data.table)
DT <- data.table(mydf)
DT[, .id := sequence(.N), by = "IDA,IDB"]
DT
#    IDA IDB values .id
# 1:   a   1      1   1
# 2:   a   2      2   1
# 3:   a   1      3   2
# 4:   b   1      4   1
# 5:   b   2      5   1
# 6:   b   2      6   2
# 7:   b   2      7   3
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • I've never worked with `data.table`, only heard of its splendour and glory :). – Paul Hiemstra Aug 30 '13 at 08:48
  • @PaulHiemstra But you have worked with it. Was it only once? http://www.r-statistics.com/2011/08/comparison-of-ave-ddply-and-data-table/ – Matt Dowle Aug 30 '13 at 09:51
  • It was a small test which I copied from someone else, see the SO post referred to in that post. I've been meaning to learn data.table, but have never felt the real need. Maybe my data is not big enough :). – Paul Hiemstra Aug 30 '13 at 10:40
  • 1
    Or use `seq_len(.N)` should be faster – akrun Oct 21 '15 at 13:08
7

meanwhile, you can also use dplyr. if your data.frame is called mydata

library(dplyr)
mydata %>% group_by(CustomerID) %>% mutate(TripCounter = row_number())
rmuc8
  • 2,869
  • 7
  • 27
  • 36
3

I need to do this often, and wrote a function that accomplishes it differently than the previous answers. I am not sure which solution is most efficient.

idCounter <- function(x)  {
    unlist(lapply(rle(x)$lengths, seq_len))
}

mydf$TripCounter <- idCounter(mydf$CustomerID)
RayB
  • 31
  • 3
  • This is incorrect. You meant to use `sequence` which is *very different* from `seq_along`. – A5C1D2H2I1M1N2O1R2T1 Aug 30 '13 at 08:46
  • Try this instead: `idCounter <- function(x) sequence(rle(x)$lengths)` – A5C1D2H2I1M1N2O1R2T1 Aug 30 '13 at 08:48
  • @Ananda, Thank you for catching this, I accidentally typed "seq_along" instead of "seq_len" (now fixed above). I was unfamiliar with `sequence`, and looking at `?sequence` I see that my function definition is similar to that of `sequence` (i.e., `sequence <- function(nvec) unlist(lapply(nvec, seq_len))`), except I replaced `nvec` with `rle(x)$lengths`. Using `sequence` directly is definitely more elegant; however, being a relatively new R user, I find my longer version helpful because I can better follow the steps involved. – RayB Aug 30 '13 at 10:23
  • another point to note is that this approach would require sorting the data first if it isn't already sorted. – A5C1D2H2I1M1N2O1R2T1 Aug 30 '13 at 17:03
0

Here's the procedure styled code. I dont believe in things like if you are using loop in R then you are probably doing something wrong

x <- dataframe$CustomerID
dataframe$counter <- 0
y <- dataframe$counter
count <- 1
for (i in 1:length(x)) {
  ifelse (x[i] == x[i-1], count <- count + 1, count <- 1 )
  y[i] <- count
}
dataframe$counter <- y
juba
  • 47,631
  • 14
  • 113
  • 118
Alok Nayak
  • 2,381
  • 22
  • 28
0

This isn't the right answer but showing some interesting things comparing to for loops, vectorization is fast does not care about sequential updating. a<-read.table(textConnection( "CustomerID TripDate 1 1/3/2013 1 1/4/2013 1 1/9/2013 2 2/1/2013 2 2/4/2013 3 1/2/2013 "), header=TRUE)

a <- a %>%
group_by(CustomerID,TripDate) # must in order

res <- rep(1, nrow(a))  #base @ 1
res[2:6] <-sapply(2:6, function(i)if(a$CustomerID[i]== a$CustomerID[i - 1]) {res[i] = res[i-1]+1} else {res[i]= res[i]})
a$TripeCounter <- res
zhan2383
  • 669
  • 5
  • 9