0

Was wondering how I would use R to calculate the below.

  1. Assuming a CSV with the following purchase data:

    | Customer ID  | Purchase Date | 
    | 1            | 01/01/2017    |
    | 2            | 01/01/2017    |
    | 3            | 01/01/2017    |
    | 4            | 01/01/2017    |
    | 1            | 02/01/2017    |
    | 2            | 03/01/2017    |
    | 2            | 07/01/2017    |
    
  2. I want to figure out the average time between repurchases by customer.

The math would be like the one below:

| Customer ID  | AVG repurchase | 
| 1            | 30 days        | = (02/01 - 01/01 / 1 order
| 2            | 90 days        | = ( (03/01 - 01/01) + (07 - 3/1) ) /2 orders
| 3            | n/a            |
| 4            | n/a            |
  1. The output would be the total average across customers -- so: 60 days = (30 avg for customer1 + 90 avg for customer2) / 2 customers.
Barbara
  • 1,118
  • 2
  • 11
  • 34
Dan F
  • 23
  • 3

2 Answers2

1

I've assumed you have read your CSV into a dataframe named df and I've renamed your variables using snake case, since having variables with a space in the name can be inconvenient, leading many to use either snake case or camel case variable naming conventions.

Here is a base R solution:

mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)

[1] 60.75

You may notice that we get 60.75 rather than 60 as you expected. This is because there are 31 days between customer 1's purchases (31 days in January until February 1), and similarly for customer 2's purchases -- there are not always 30 days in a month.

Explanation

by(df$purchase_date, df$customer_id, diff)

The by() function applies another function to data by groupings. Here, we are applying diff() to df$purchase_date by the unique values of df$customer_id. By itself, this would result in the following output:

df$customer_id: 1
Time difference of 31 days
----------------------------------------------------------- 
df$customer_id: 2
Time differences in days
[1]  59 122

We then use

sapply(by(df$purchase_date, df$customer_id, diff), mean)

to apply mean() to the elements of the previous result. This gives us each customer's average time to repurchase:

   1    2    3    4 
31.0 90.5  NaN  NaN

(we see customers 3 and 4 never repurchased). Finally, we need to average these average repurchase times, which means we need to also deal with those NaN values, so we use:

mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)

which will average the previous results, ignoring missing values (which, in R include NaN values).

duckmayr
  • 16,303
  • 3
  • 35
  • 53
  • This is great, thank you. For some reason I get an error: ``` Error in as.character.factor(x) : malformed factor ``` Code: ``` orders <- read.csv(file="/r_example.csv",head=TRUE,sep=",") by(orders$Timestamp, orders$personIdentifier, diff) #error is Error in as.character.factor(x) : malformed factor mean(sapply(by(orders_with_timestamp$Timestamp, orders_with_timestamp$personIdentifier, diff), mean), na.rm=TRUE) #gives warnings ``` – Dan F Nov 17 '17 at 20:35
  • Basically "Error in as.character.factor(x) : malformed factor" on the by( ... ) function call. Tried to update my comment above but lost on the formatting... ha – Dan F Nov 17 '17 at 20:42
  • I believe this error can only be reproduced, and therefore fixed/addressed, by having your data. If you add the output of `dput(orders)` to your question, it could help. For future reference, you will typically get better answers if you do that initially (which is why Barbara left a comment to that effect). – duckmayr Nov 17 '17 at 20:58
  • Thank you. Sorry, I'm still new to R - I learned quite a bit from this thread. I got it everything working great by converting the field to a date field in the DF. df$purchase_date <- as.Date(df$purchase_date , "%m/%d/%y"). Then everything worked. – Dan F Nov 17 '17 at 21:12
0

Here's another solution with dplyr + lubridate:

library(dplyr)
library(lubridate)

df %>%
  mutate(Purchase_Date = mdy(Purchase_Date)) %>%
  group_by(Customer_ID) %>%
  summarize(AVG_Repurchase = sum(difftime(Purchase_Date, 
                                          lag(Purchase_Date), units = "days"), 
                                 na.rm=TRUE)/(n()-1))

or with data.table:

library(data.table)

setDT(df)[, Purchase_Date := mdy(Purchase_Date)]

df[, .(AVG_Repurchase = sum(difftime(Purchase_Date, 
                                     shift(Purchase_Date), units = "days"), 
                            na.rm=TRUE)/(.N-1)), by = "Customer_ID"]

Result:

# A tibble: 4 x 2
  Customer_ID AVG_Repurchase
        <dbl>         <time>
1           1      31.0 days
2           2      90.5 days
3           3       NaN days
4           4       NaN days

   Customer_ID AVG_Repurchase
1:           1      31.0 days
2:           2      90.5 days
3:           3       NaN days
4:           4       NaN days

Note:

I first converted Purchase_Date to mmddyyyy format, then group_by Customer_ID. Final for each Customer_ID, I calculated the mean of the days difference between Purchase_Date and it's lag.

Data:

df = structure(list(Customer_ID = c(1, 2, 3, 4, 1, 2, 2), Purchase_Date = c(" 01/01/2017", 
" 01/01/2017", " 01/01/2017", " 01/01/2017", " 02/01/2017", " 03/01/2017", 
" 07/01/2017")), .Names = c("Customer_ID", "Purchase_Date"), class = "data.frame", row.names = c(NA, 
-7L))
acylam
  • 18,231
  • 5
  • 36
  • 45