Group data.frame by the consecutive runs of an id variable and find earliest and latest timestamp

Question

My data.frame is structured as follows:

   Prod.order Operation           Timestamp
1:        126      B100 2022-08-15 10:30:00
2:        126      B100 2022-08-15 10:33:00
3:        126      B100 2022-08-15 10:35:00
4:        126      D200 2022-08-15 10:40:00
5:        126      D200 2022-08-15 10:45:00
6:        126      B100 2022-08-15 10:50:00
7:        126      B100 2022-08-15 10:52:00
8:        126      B100 2022-08-15 10:55:00

Prod.order is an id variable. For each consecutive run of a particular Operation, I want to find the earliest and latest Timestamp. In the above case, my desired output is:

   Prod.order Operation            earliest              latest
1:        126      B100 2022-08-15 10:30:00 2022-08-15 10:35:00
2:        126      D200 2022-08-15 10:40:00 2022-08-15 10:45:00
3:        126    B100_2 2022-08-15 10:50:00 2022-08-15 10:55:00

Note that because there were two runs of consecutive B100s (separated by a run of consecutive D200s, B100 shows up here twice (with the second pass denoted by _2).

My actual data.frame has a similar set-up, but for multiple values of Prod.order (see bottom of post). Given this, I need to (1) detect the different consecutive runs of Operations within each Prod.order, and (2) find the earliest and latest Timestamp per Operation per Prod.order.

Example data below: df_current is raw data, df_desired is desired output.

p.s. solution that can be piped are extra awesome xD

Thanks!

df_current <- data.frame(
    Prod.order = c(rep(123, 23), rep(124, 4), rep(125, 3), rep(126, 8)),
    Operation = c(
    rep('B100', 6), rep('D200', 6), rep('B100', 5), rep('D300', 6),
    rep('B100', 4),
    rep('B100', 1), rep('D200', 1), rep('D300', 1),
    rep('B100', 3), rep('D200', 2), rep('B100', 3)),
    Timestamp = as.POSIXct(c(
    "2022-08-15 10:00:00", "2022-08-15 10:01:00", "2022-08-15 10:02:00", "2022-08-15 10:03:00", "2022-08-15 10:04:00", "2022-08-15 10:05:00",
    "2022-08-15 10:10:00", "2022-08-15 10:11:00", "2022-08-15 10:12:00", "2022-08-15 10:13:00", "2022-08-15 10:14:00", "2022-08-15 10:15:00",
    "2022-08-15 10:20:00", "2022-08-15 10:21:00", "2022-08-15 10:23:00", "2022-08-15 10:24:00", "2022-08-15 10:25:00",
    "2022-08-15 10:30:00", "2022-08-15 10:31:00", "2022-08-15 10:32:00", "2022-08-15 10:33:00", "2022-08-15 10:34:00", "2022-08-15 10:35:00",
    "2022-08-15 10:10:00", "2022-08-15 10:12:00", "2022-08-15 10:14:00", "2022-08-15 10:15:00",
    "2022-08-15 10:20:00",
    "2022-08-15 10:30:00",
    "2022-08-15 10:40:00",
    "2022-08-15 10:30:00", "2022-08-15 10:33:00", "2022-08-15 10:35:00",
    "2022-08-15 10:40:00", "2022-08-15 10:45:00",
    "2022-08-15 10:50:00", "2022-08-15 10:52:00", "2022-08-15 10:55:00"),
    format= "%Y-%m-%d %H:%M:%S")
)

df_desired <- data.frame(
    Prod.order = c(rep(123,4), 124, rep(125,3), rep(126,3)),
    operation = c(
    'B100', 'D200', 'B100_2', 'D300',
    'B100',
    'B100', 'D200', 'D300',
    'B100', 'D200', 'B100_2'),
    earliest = as.POSIXct(c(
    "2022-08-15 10:00:00", "2022-08-15 10:10:00", "2022-08-15 10:20:00", "2022-08-15 10:30:00",
    "2022-08-15 10:10:00",
    "2022-08-15 10:20:00", "2022-08-15 10:30:00", "2022-08-15 10:40:00",
    "2022-08-15 10:30:00", "2022-08-15 10:40:00", "2022-08-15 10:50:00"),
    format= "%Y-%m-%d %H:%M:%S"),
    latest = as.POSIXct(c(
    "2022-08-15 10:05:00", "2022-08-15 10:15:00", "2022-08-15 10:25:00", "2022-08-15 10:35:00",
    "2022-08-15 10:15:00",
    "2022-08-15 10:20:00", "2022-08-15 10:30:00", "2022-08-15 10:40:00",
    "2022-08-15 10:35:00", "2022-08-15 10:45:00", "2022-08-15 10:55:00"),
    format= "%Y-%m-%d %H:%M:%S")
)

diomedesdata · Accepted Answer · 2022-08-30T12:38:44.947

A data.table solution:

First step: detect groups

Use rleid() to figure out when an Operation is repeated within the same Prod.order. Note that \(x) is just shorthand for function(x).

library(data.table)
setDT(df_current)

df_current[, group := rleid(Operation), by = Prod.order]
df_current <- split(df_current, by = "Prod.order") # unfortunately, slow

lapply(df_current, 
    \(x) x[, Operation := paste0(Operation, "_", rleid(group)), by = Operation])

df_current <- rbindlist(df_current)
df_current
    Prod.order Operation           Timestamp group
 1:        123    B100_1 2022-08-15 10:00:00     1
 2:        123    B100_1 2022-08-15 10:01:00     1
 3:        123    B100_1 2022-08-15 10:02:00     1
 4:        123    B100_1 2022-08-15 10:03:00     1
 5:        123    B100_1 2022-08-15 10:04:00     1

34:        126    D200_1 2022-08-15 10:40:00     2
35:        126    D200_1 2022-08-15 10:45:00     2
36:        126    B100_2 2022-08-15 10:50:00     3
37:        126    B100_2 2022-08-15 10:52:00     3
38:        126    B100_2 2022-08-15 10:55:00     3

Second step: calculate min and max

Grouping by Prod.order and Operation, find the minimum and maximum timestamps.

df_current[, .(earliest = min(Timestamp),
               latest = max(Timestamp)),
           by = c("Prod.order","Operation")]

# Optional, ordering
df_current[, .(earliest = min(Timestamp),
               latest = max(Timestamp)),
           by = c("Prod.order","Operation")][order(Prod.order)]

    Prod.order Operation            earliest              latest
 1:        123    B100_1 2022-08-15 10:00:00 2022-08-15 10:05:00
 2:        123    D200_1 2022-08-15 10:10:00 2022-08-15 10:15:00
 3:        123    B100_2 2022-08-15 10:20:00 2022-08-15 10:25:00
 4:        123    D300_1 2022-08-15 10:30:00 2022-08-15 10:35:00
 5:        124    B100_1 2022-08-15 10:10:00 2022-08-15 10:15:00
 6:        125    B100_1 2022-08-15 10:25:00 2022-08-15 10:25:00
 7:        125    D200_1 2022-08-15 10:35:00 2022-08-15 10:35:00
 8:        125    D300_1 2022-08-15 10:45:00 2022-08-15 10:45:00
 9:        126    B100_1 2022-08-15 10:30:00 2022-08-15 10:35:00
10:        126    D200_1 2022-08-15 10:40:00 2022-08-15 10:45:00
11:        126    B100_2 2022-08-15 10:50:00 2022-08-15 10:55:00

Sorry, I now see my current dataset was incorrect. In the actual dataset it does not mention the "_2" after the B100 to indicate the second time the order was processed by B100. Hence there is a first step needed to mutate the strings in the operation column as "B100_2" for those loggings that belong to the second pass. After that your solution will work perfectly i think. Do you know a solution for that? — Roel, Aug 30 '22 at 01:42
@Roel I believe the above solution works - I think your `df_desired` is wrong. For example, how can the earliest `Timestamp` for `Prod.order == 125` and `Operation == "B100"` be `10:20:00` when that `Timestamp` does not exist in `df_current` under those conditions? — diomedesdata, Aug 30 '22 at 11:09
Yes! your solution works perfectly. thank you so much :D. Does the data need to be ordered for the first step to work? p.s. The df_desired is now corrected in the original question. — Roel, Aug 31 '22 at 02:43
@Roel yes, the data needs to be ordered as in `df_current` for this to work, otherwise there wouldn't be a away to tell the different consecutive runs of each `Operation` inside a `Prod.order` apart. — diomedesdata, Aug 31 '22 at 10:55

Group data.frame by the consecutive runs of an id variable and find earliest and latest timestamp

1 Answers1

First step: detect groups

Second step: calculate min and max

Linked