0

The situation: I have a dataset containing transactions (transData). Each transactions has its line, which has relevant columns: transactionID, customerID, Date and moneySpend.

Simplified example:

1; 101; 1/1/18; 42
2; 101; 1/1/18; 13
3; 102; 1/1/18; 32
4; 103; 1/1/18; 56
5; 103; 1/1/18; 85
6; 103; 2/1/18; 8
7; 101; 2/1/18; 23
8; 103; 2/1/18; 14
9; 103; 2/1/18; 35
10; 104; 2/1/18; 48

What I need: A single customer can buy multiple items per day, however each item has its own line in the transactions dataset. However, I need these transactions combined into a single one, where the moneySpend is the sum of the individual items.

Simplified example:

1; 101; 1/1/18; 55
2; 102; 1/1/18; 32
3; 103; 1/1/18; 141
4; 103; 2/1/18; 77
5; 101; 2/1/18; 23
6; 104; 2/1/18; 48

(Note: the transactionID is not important, aslong as it is unique.)

What I have done: With the ddply from the plyr package I create a table that sorts out the unqiue combination of customerId and day:

newTable <- ddply(transData, .(transData$customerID, transData$Date), nrow)

Next I sum up the transaction in a for loop:

for (i in 1:dim(newTable)[1]){ 
  trans = which(transData$customerID==newTable[i,1] & transData$Date==newTable[i,2])
  totalSpend[i]=sum(transData[trans,32:35])
}

The problem: This is way too slow for the amount of transactions needed to be processed.

Is there a way to do this (way) more efficiently?

  • 1
    In order to help us help you, please provide a reproducible example. And mesure your time using `microbenchmark`package. Also you should try to vectorize your for loop – Emmanuel-Lin Feb 09 '18 at 13:10
  • 1
    Could you use `dput` to provide reproducible example – Emmanuel-Lin Feb 09 '18 at 13:12
  • So you want to sum your moneyspend by custumer_id? Check this https://stackoverflow.com/questions/1660124/how-to-sum-a-variable-by-group – Koot6133 Feb 09 '18 at 13:13

2 Answers2

0

In data.table, simply:

transData[, newVar := sum(moneySpend), by = c("customerID", "Date")]
C-x C-c
  • 1,261
  • 8
  • 20
0

I found a solution based on some comments here using the dplyr package.

transactions = transData %>% 
   group_by(customerID,Date) %>% 
   summarise(moneySpend = sum(moneySpend))

Thanks for thinking along.