I have a dataset that I can reduce to three columns - CustomerID, EnterDate, ReturnDate. I would like to add a 3rd column which states whether or not, if a CustomerID appears more than once in the dataset, the second 'EnterDate' is within 30 days of the first 'ExitDate' (and the third is within 30 days of the second etc. if there are multiple entries for a single Customer ID).
So to turn a table like this:
CustomerID | EnterDate | ExitDate |
---|---|---|
1 | 14/09/2021 | 15/09/2021 |
1 | 03/10/2021 | 11/10/2021 |
2 | 03/10/2021 | 01/10/2021 |
2 | 17/10/2021 | 11/11/2021 |
3 | 03/10/2021 | 11/10/2021 |
3 | 30/12/2021 | 31/12/2021 |
4 | 03/10/2021 | 09/07/2022 |
In to this - an entry of '1' is entered in new column 'ResaleWithin30' if CustomerID matches and 'EnterDate' is within 30 days of previous 'ExitDate'.
CustomerID | EnterDate | ExitDate | ResaleWithin30 |
---|---|---|---|
1 | 14/09/2021 | 15/09/2021 | 0 |
1 | 03/10/2021 | 11/10/2021 | 1 |
2 | 03/10/2021 | 01/10/2021 | 0 |
2 | 17/10/2021 | 11/11/2021 | 1 |
3 | 03/10/2021 | 11/10/2021 | 0 |
3 | 30/12/2021 | 31/12/2021 | 0 |
4 | 03/10/2021 | 09/07/2022 | 0 |
The below code works for comparing just EnterDate to previous EnterDate but I'd like essentially EnterDate to compare to ExitDate. I assume I need to amend the mutate statement to apply to both EnterDate and ExitDate and then change the lag to compare EnterDate to ExitDate however I am getting in to various errors trying to get this completed so any amendment/help would be very much appreciated. Thank you!
library(dplyr)
df %>%
group_by(CustomerID) %>%
mutate(EnterDate = as.Date(EnterDate, tryFormats = "%d/%m/%Y"),
ResaleWithin30 = as.integer(EnterDate - lag(EnterDate) <= 30),
ResaleWithin30 = replace_na(ResaleWithin30, 0))