2

I'd like to index customer transactions in an R dataframe so that I can easily identify, say, the third transaction that a particular customer has made. For example, if I have the following data frame (ordered by customer and transaction date):

transactions = data.frame(CUST.ID = c(1, 1, 2, 2, 2, 2, 3, 3, 3), 
DATE = as.Date(c("2009-07-02", "2013-08-15", "2010-01-02", "2004-03-05", 
"2006-02-03", "2007-01-01", "2004-03-05", "2006-02-03", "2007-01-01")),
AMOUNT = c(5, 9, 21, 34, 76, 1, 100, 23, 10))



> transactions
  CUST.ID       DATE AMOUNT
1       1 2009-07-02      5
2       1 2013-08-15      9
3       2 2010-01-02     21
4       2 2004-03-05     34
5       2 2006-02-03     76
6       2 2007-01-01      1
7       3 2004-03-05    100
8       3 2006-02-03     23
9       3 2007-01-01     10

I can clearly see that customer 1 has made 2 transactions, customer 2 has made 4, etc.

What I would like is to index these transactions by customer, creating a new column in my dataframe. The following code achieves what I want:

transactions$COUNTER = 1
transactions$CUSTOMER.TRANS.NO = unlist(aggregate(COUNTER ~ CUST.ID, 
data = transactions, 
function(x) {rank(x, ties.method = "first")})[, 2])
transactions$COUNTER = NULL


> transactions
  CUST.ID       DATE AMOUNT CUSTOMER.TRANS.NO
1       1 2009-07-02      5                 1
2       1 2013-08-15      9                 2
3       2 2010-01-02     21                 1
4       2 2004-03-05     34                 2
5       2 2006-02-03     76                 3
6       2 2007-01-01      1                 4
7       3 2004-03-05    100                 1
8       3 2006-02-03     23                 2
9       3 2007-01-01     10                 3

Now the first transaction for each customer is labelled 1, the second 2, etc.

So I've got what I want but it's such a horrible piece of code, creating a list and separating, it's just so ugly. Is anyone with more experience than me able to come up with a better solution?

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485

2 Answers2

2

Because you've taken the effort to post the sample code you tried (making your question a better Stack Overflow question than the duplicate I've linked to), I'll summarize the options here:

ave

within(transactions, { Trans.No <- ave(CUST.ID, CUST.ID, FUN = seq_along) })

getanID

library(splitstackshape)
getanID(transactions, "CUST.ID")

rle

## Depends on your data being sorted
transactions$Trans.No <- sequence(rle(transactions$CUST.ID)$lengths)

data.table

library(data.table)
DT <- data.table(transactions)
DT[, .id := sequence(.N), by = "CUST.ID"]
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thank you for your help, and sorry to post a duplicate question. I'd like to use the ave function but can't find a way to group by CUST.ID and DATE. Clearly if the dataframe is sorted it will work but now I'm getting greedy and would like to index unsorted data. – user2733680 Sep 02 '13 at 10:58
  • @user2733680, the `ave` function should work on unsorted data. It's `rle` that won't work because it is designed to determine *run lengths*. – A5C1D2H2I1M1N2O1R2T1 Sep 02 '13 at 11:33
  • @user2733680, and there's no problem in posting a duplicated question. It just gets marked as such and helps serve as a signpost for others looking to solve similar problems. Just try not to do it too often :) – A5C1D2H2I1M1N2O1R2T1 Sep 02 '13 at 11:39
0
library(plyr)
 ddply(transactions,.(CUST.ID),transform,CUSTOMER.TRANS.NO=seq(1,length(CUST.ID),1))
  CUST.ID       DATE AMOUNT CUSTOMER.TRANS.NO
1       1 2009-07-02      5                 1
2       1 2013-08-15      9                 2
3       2 2010-01-02     21                 1
4       2 2004-03-05     34                 2
5       2 2006-02-03     76                 3
6       2 2007-01-01      1                 4
7       3 2004-03-05    100                 1
8       3 2006-02-03     23                 2
9       3 2007-01-01     10                 3
Metrics
  • 15,172
  • 7
  • 54
  • 83