Create counter with multiple variables

Question

I have my data that looks like below:

CustomerID TripDate
1           1/3/2013
1           1/4/2013
1           1/9/2013
2           2/1/2013
2           2/4/2013
3           1/2/2013

I need to create a counter variable, which will be like below:

CustomerID TripDate   TripCounter
1           1/3/2013   1
1           1/4/2013   2 
1           1/9/2013   3
2           2/1/2013   1
2           2/4/2013   2 
3           1/2/2013   1

Tripcounter will be for each customer.

score 14 · Accepted Answer · edited Feb 07 '16 at 17:59

Use ave. Assuming your data.frame is called "mydf":

mydf$counter <- with(mydf, ave(CustomerID, CustomerID, FUN = seq_along))
mydf
#   CustomerID TripDate counter
# 1          1 1/3/2013       1
# 2          1 1/4/2013       2
# 3          1 1/9/2013       3
# 4          2 2/1/2013       1
# 5          2 2/4/2013       2
# 6          3 1/2/2013       1

For what it's worth, I also implemented a version of this approach in a function included in my "splitstackshape" package. The function is called getanID:

mydf <- data.frame(IDA = c("a", "a", "a", "b", "b", "b", "b"),
                   IDB = c(1, 2, 1, 1, 2, 2, 2), values = 1:7)
mydf
# install.packages("splitstackshape")
library(splitstackshape)
# getanID(mydf, id.vars = c("IDA", "IDB"))
getanID(mydf, id.vars = 1:2)
#   IDA IDB values .id
# 1   a   1      1   1
# 2   a   2      2   1
# 3   a   1      3   2
# 4   b   1      4   1
# 5   b   2      5   1
# 6   b   2      6   2
# 7   b   2      7   3

As you can see from the example above, I've written the function in such a way that you can specify one or more columns that should be treated as ID columns. It checks to see if any of the id.vars are duplicated, and if they are, then it generates a new ID variable for you.

score 9 · Answer 2 · edited Aug 30 '13 at 08:53

9

You can also use plyr for this (using @AnadaMahto's example data):

> ddply(mydf, .(IDA), transform, .id = seq_along(IDA))
  IDA IDB values .id
1   a   1      1   1
2   a   2      2   2
3   a   1      3   3
4   b   1      4   1
5   b   2      5   2
6   b   2      6   3
7   b   2      7   4

or even:

> ddply(mydf, .(IDA, IDB), transform, .id = seq_along(IDA))
  IDA IDB values .id
1   a   1      1   1
2   a   1      3   2
3   a   2      2   1
4   b   1      4   1
5   b   2      5   1
6   b   2      6   2
7   b   2      7   3

Note that plyr does not have a reputation for being the quickest solution, for that you need to take a look at data.table.

Here's a data.table approach:

library(data.table)
DT <- data.table(mydf)
DT[, .id := sequence(.N), by = "IDA,IDB"]
DT
#    IDA IDB values .id
# 1:   a   1      1   1
# 2:   a   2      2   1
# 3:   a   1      3   2
# 4:   b   1      4   1
# 5:   b   2      5   1
# 6:   b   2      6   2
# 7:   b   2      7   3

edited Aug 30 '13 at 08:53

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

answered Aug 30 '13 at 08:45

Paul Hiemstra

59,984
12
142
149

I've never worked with `data.table`, only heard of its splendour and glory :). – Paul Hiemstra Aug 30 '13 at 08:48
@PaulHiemstra But you have worked with it. Was it only once? http://www.r-statistics.com/2011/08/comparison-of-ave-ddply-and-data-table/ – Matt Dowle Aug 30 '13 at 09:51
It was a small test which I copied from someone else, see the SO post referred to in that post. I've been meaning to learn data.table, but have never felt the real need. Maybe my data is not big enough :). – Paul Hiemstra Aug 30 '13 at 10:40
1

Or use `seq_len(.N)` should be faster – akrun Oct 21 '15 at 13:08

score 7 · Answer 3 · answered Jun 19 '15 at 10:00

7

meanwhile, you can also use dplyr. if your data.frame is called mydata

library(dplyr)
mydata %>% group_by(CustomerID) %>% mutate(TripCounter = row_number())

answered Jun 19 '15 at 10:00

rmuc8

2,869
7
27
36

RayB · Answer 4 · 2013-08-30T09:52:38.343

3

I need to do this often, and wrote a function that accomplishes it differently than the previous answers. I am not sure which solution is most efficient.

idCounter <- function(x)  {
    unlist(lapply(rle(x)$lengths, seq_len))
}

mydf$TripCounter <- idCounter(mydf$CustomerID)

edited Aug 30 '13 at 09:52

answered Aug 30 '13 at 08:24

RayB

31
3

This is incorrect. You meant to use `sequence` which is *very different* from `seq_along`. – A5C1D2H2I1M1N2O1R2T1 Aug 30 '13 at 08:46
Try this instead: `idCounter <- function(x) sequence(rle(x)$lengths)` – A5C1D2H2I1M1N2O1R2T1 Aug 30 '13 at 08:48
@Ananda, Thank you for catching this, I accidentally typed "seq_along" instead of "seq_len" (now fixed above). I was unfamiliar with `sequence`, and looking at `?sequence` I see that my function definition is similar to that of `sequence` (i.e., `sequence <- function(nvec) unlist(lapply(nvec, seq_len))`), except I replaced `nvec` with `rle(x)$lengths`. Using `sequence` directly is definitely more elegant; however, being a relatively new R user, I find my longer version helpful because I can better follow the steps involved. – RayB Aug 30 '13 at 10:23
another point to note is that this approach would require sorting the data first if it isn't already sorted. – A5C1D2H2I1M1N2O1R2T1 Aug 30 '13 at 17:03

score 0 · Answer 5 · edited Aug 30 '13 at 08:40

0

Here's the procedure styled code. I dont believe in things like if you are using loop in R then you are probably doing something wrong

x <- dataframe$CustomerID
dataframe$counter <- 0
y <- dataframe$counter
count <- 1
for (i in 1:length(x)) {
  ifelse (x[i] == x[i-1], count <- count + 1, count <- 1 )
  y[i] <- count
}
dataframe$counter <- y

edited Aug 30 '13 at 08:40

juba

47,631
14
113
118

answered Aug 28 '13 at 12:21

Alok Nayak

2,381
22
28

1

Maybe you are not doing something wrong, but surely you will be doing something slow. – juba Aug 30 '13 at 08:27
1

And I'd prefer the one-liner of @AnandaMahto over your 9 lines of code. – Paul Hiemstra Aug 30 '13 at 08:37
1

turns out, I wud prefer one liner too and I would surely benchmark to see my slowness – Alok Nayak Aug 30 '13 at 10:00

score 0 · Answer 6 · answered Apr 20 '16 at 22:21

This isn't the right answer but showing some interesting things comparing to for loops, vectorization is fast does not care about sequential updating. a<-read.table(textConnection( "CustomerID TripDate 1 1/3/2013 1 1/4/2013 1 1/9/2013 2 2/1/2013 2 2/4/2013 3 1/2/2013 "), header=TRUE)

a <- a %>%
group_by(CustomerID,TripDate) # must in order

res <- rep(1, nrow(a))  #base @ 1
res[2:6] <-sapply(2:6, function(i)if(a$CustomerID[i]== a$CustomerID[i - 1]) {res[i] = res[i-1]+1} else {res[i]= res[i]})
a$TripeCounter <- res

Create counter with multiple variables

6 Answers6

Linked

Related