New variable by date range in R

Question

I am trying to create a new variable that is based on customer_id and dates. The table is a log of all customer contact, so there will be duplicate customer IDs. What I want to do it create a new variable that has a sequential count by using dates of contact within x days for each customer. All first contact from a customer will be = 1 and if there is a gap greater than x days since last contact, then that contact will be 2 and so on and so forth. I am trying to create the "Journey" variable.

Thanks for any guidance.

enter image description here

here is the code:

structure(list(Customer = structure(c(1L, 1L, 2L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), 
    Start_dt = c("2018-04-30 13:47:13", "2018-05-03 09:22:25", 
    "2018-04-22 10:45:33", "2018-04-20 09:55:51", "2018-04-21 14:20:33", 
    "2018-05-01 15:27:43", "2018-03-28 11:25:45", "2018-04-28 10:30:35", 
    "2018-05-17 11:08:51", "2018-06-02 10:38:38"), End_dt = c("2018-04-30 14:22:15", 
    "2018-05-03 10:05:32", "2018-04-22 11:00:35", "2018-04-20 09:57:45", 
    "2018-04-21 14:27:14", "2018-05-01 16:03:25", "2018-03-28 11:35:54", 
    "2018-04-28 11:02:17", "2018-05-17 12:32:18", "2018-06-02 11:08:29"
    ), Journey = c(1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 3L, 4L)), class = "data.frame", row.names = c(NA, 
-10L))

So what did you actually try yourself? Take a look here [How to ask on SO](https://stackoverflow.com/help/how-to-ask) — Pinkie Swirl, Aug 20 '18 at 00:32
Please don't post screenshots of your data. You should review how to provide a [minimal reproducible example/attempt](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to provide (1) sample data in a copy&paste-able format (e.g. use `dput`), (2) your code attempt, and (3) your expected output. — Maurits Evers, Aug 20 '18 at 00:36

Artem · Answer 1 · 2018-09-19T19:14:18.217

Please see the algorithm below which is converting character vector into Date object, then it splits data.frame by factor column. Inside lapply function the algorithm is checking the conditions for Journey identification using zlag function. In the end it is using do.call function to concatenate data frames.

df <- structure(list(Customer = structure(c(1L, 1L, 2L, 3L, 3L, 3L, 
4L, 4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"), 
    Start_dt = structure(c(17651, 17654, 17643, 17641, 17642, 
    17652, 17618, 17649, 17668, 17684), class = "Date"), End_dt = structure(c(17651, 
    17654, 17643, 17641, 17642, 17652, 17618, 17649, 17668, 17684
    ), class = "Date")), row.names = c(NA, -10L), class = "data.frame")
library(lubridate, TSA)
df$Start_dt <- as_date(df$Start_dt)
df$End_dt <- as_date(df$End_dt)

x <- 10 # 10 days

y <- lapply(
  X = split(df, df$Customer), 
  FUN = function(dfx) {
    dfx$lagged <- as_date(zlag(dfx$Start_dt))
    dfx$dt <- dfx$Start_dt - dfx$lagged
    dfx$dt <- ifelse(dfx$dt < x, 0, 1)
    dfx$dt[1] <- 1
    dfx$Journey <- cumsum(dfx$dt)
    dfx[, -c(5:6)]
})
z <- do.call(rbind, y)
rownames(z) <- NULL
z

Output:

   Customer   Start_dt     End_dt Journey
1         A 2018-04-30 2018-04-30       1
2         A 2018-05-03 2018-05-03       1
3         B 2018-04-22 2018-04-22       1
4         C 2018-04-20 2018-04-20       1
5         C 2018-04-21 2018-04-21       1
6         C 2018-05-01 2018-05-01       2
7         D 2018-03-28 2018-03-28       1
8         D 2018-04-28 2018-04-28       2
9         D 2018-05-17 2018-05-17       3
10        D 2018-06-02 2018-06-02       4

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

For the sake of completeness, here is a more streamlined data.table version of Artem's cumsum() approach:

library(data.table)
library(lubridate)
x <- 9
setDT(df)[, journey := 1 + cumsum(shift(End_dt, fill = End_dt[1]) + days(x) < Start_dt), 
          by = Customer]

df

    Customer            Start_dt              End_dt journey
 1:        A 2018-04-30 13:47:13 2018-04-30 14:22:15       1
 2:        A 2018-05-03 09:22:25 2018-05-03 10:05:32       1
 3:        B 2018-04-22 10:45:33 2018-04-22 11:00:35       1
 4:        C 2018-04-20 09:55:51 2018-04-20 09:57:45       1
 5:        C 2018-04-21 14:20:33 2018-04-21 14:27:14       1
 6:        C 2018-05-01 15:27:43 2018-05-01 16:03:25       2
 7:        D 2018-03-28 11:25:45 2018-03-28 11:35:54       1
 8:        D 2018-04-28 10:30:35 2018-04-28 11:02:17       2
 9:        D 2018-05-17 11:08:51 2018-05-17 12:32:18       3
10:        D 2018-06-02 10:38:38 2018-06-02 11:08:29       4

Data

Initially, the OP has only posted a screenshot of his data. I have tried to convert the image with help of an online OCR converter which worked pretty well for both datetime columns:

library(magrittr)
df <- data.frame(
  Customer = rep(LETTERS[1:4], c(2, 1, 3, 4)),
  readr::read_csv("Start_dt
2018—04—30 13:47:13
2018—05-03 09:22:25
2018—04—22 10:45:33
2018—04—20 09:55:51
2018—04—21 14:20:33
2018—05—01 15:27:43
2018—03—28 11:25:45
2018—04-28 10:30:35
2018—05—17 11:08:51
2018—06-02 10:38:38
") %>% 
    dplyr::mutate_all(lubridate::ymd_hms),
  readr::read_csv("End_dt
2018—04—30 14:22:15
2018—05—03 10:05:32
2018—04—22 11:00:35
2018—04-20 09:57:45
2018—04—21 14:27:14
2018—05—01 16:03:25
2018—03—28 11:35:54
2018—04—28 11:02:17
2018—05—17 12:32:18
2018—06—02 11:08:29
") %>% 
    dplyr::mutate_all(lubridate::ymd_hms)
)

it's really cool what you've done with `data.table`. I'll certainly have a look. — Artem, Sep 19 '18 at 21:37

New variable by date range in R

2 Answers2

Data