0

I have a large sales dataset (> 1 M rows) that contains 2 columns:

  • Client_ID
  • Purchase_Date

Client ID often repeats.

For each row, I would like to calculate "How many times has the client purchased in the last 30 days."

  • Client_ID
  • Purchase_Date
  • Freq_Last_30_Days

Currently, I loop through each row and do a nrow() count using the following criteria:

data$Client_ID == data$Client_ID[i]
data$Purchase_Date < data$Purchase_Date[i]
data$Purchase_Date >= data$Purchase_Date[i] - 30

Given the database size, this is taking over hours to do. Is there a more efficient way of finding the frequency without using a for loop?

1 Answers1

0

You could use sapply instead of a for loop, but probably is still slow (but worth a try). dplyr's filter will be a little faster than base R's [] method as well.

For example,

dfr <- data.frame(Client_ID = sample(letters[1:15], 100, replace=TRUE), 
              Date=seq.Date(as.Date("2016-11-1"), length=100, by="1 day"))

row_fun <- function(i){
  subs <- dfr[dfr$Date > dfr$Date[i] - 30 & dfr$Date < dfr$Date[i],]
  sum(subs$Client_ID == dfr$Client_ID[i])
}
dfr$Freq_Last_30_Days <- sapply(1:nrow(dfr), row_fun)
Remko Duursma
  • 2,741
  • 17
  • 24