I've assumed you have read your CSV into a dataframe named df
and I've renamed your variables using snake case, since having variables with a space in the name can be inconvenient, leading many to use either snake case or camel case variable naming conventions.
Here is a base R solution:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
[1] 60.75
You may notice that we get 60.75
rather than 60 as you expected. This is because there are 31 days between customer 1's purchases (31 days in January until February 1), and similarly for customer 2's purchases -- there are not always 30 days in a month.
Explanation
by(df$purchase_date, df$customer_id, diff)
The by()
function applies another function to data by groupings. Here, we are applying diff()
to df$purchase_date
by the unique values of df$customer_id
. By itself, this would result in the following output:
df$customer_id: 1
Time difference of 31 days
-----------------------------------------------------------
df$customer_id: 2
Time differences in days
[1] 59 122
We then use
sapply(by(df$purchase_date, df$customer_id, diff), mean)
to apply mean()
to the elements of the previous result. This gives us each customer's average time to repurchase:
1 2 3 4
31.0 90.5 NaN NaN
(we see customers 3 and 4 never repurchased). Finally, we need to average these average repurchase times, which means we need to also deal with those NaN
values, so we use:
mean(sapply(by(df$purchase_date, df$customer_id, diff), mean), na.rm=TRUE)
which will average the previous results, ignoring missing values (which, in R include NaN values).