Compute variables conditionally and sequentially within groups (dplyr)

Question

I'm trying to compute two variables on the basis of a set of basic decision rules. However, the computation for most rows depends on values in other rows, which requires this computation to be performed sequentially. I can't figure out how to do this efficiently (ideally in the tidyverse). Thanks in advance!

Data:

Each row of the dataset represents a submitted work report. Each report is nested in a group, with k indicating the index of the report in the group, and n representing the number of reports in the group. time is the timestamp of the submission, length is the approximate length of the work, in minutes, and wait represents whether there was a wait before starting the work (wait == 1) or whether the work was started immediately after finishing the previous one (wait == 0)

+-------+---+---+---------------------+--------+------+
| group | k | n |        time         | length | wait |
+-------+---+---+---------------------+--------+------+
| A     | 1 | 5 | 2017-10-17 12:43:29 |   17.5 |    1 |
| A     | 2 | 5 | 2017-10-17 12:44:52 |   45.5 |    0 |
| A     | 3 | 5 | 2017-10-17 12:45:58 |   17.5 |    1 |
| A     | 4 | 5 | 2017-10-17 13:45:31 |      5 |    1 |
| A     | 5 | 5 | 2017-10-17 13:46:48 |   17.5 |    0 |
| B     | 1 | 3 | 2017-11-14 12:07:18 |   45.5 |    1 |
| B     | 2 | 3 | 2017-11-14 12:14:43 |   45.5 |    1 |
| B     | 3 | 3 | 2017-11-14 12:17:45 |   45.5 |    1 |
+-------+---+---+---------------------+--------+------+

Edit: importable data example

structure(list(group = c("A", "A", "A", "A", "A", "B", "B", "B"
), k = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L), n = c(5L, 5L, 5L, 5L, 
5L, 3L, 3L, 3L), time = structure(c(1508258609.388, 1508258692.614, 
1508258758.234, 1508262331.385, 1508262408.434, 1510679238.849, 
1510679683.961, 1510679865.964), class = c("POSIXct", "POSIXt"
), tzone = "America/New_York"), length = c(17.5, 45.5, 17.5, 5, 17.5, 45.5, 45.5, 
45.5), wait = c(1, 0, 1, 1, 0, 1, 1, 1)), row.names = c(NA, -8L
), class = "data.frame")

I'm trying to compute two new values: start and end, the start and end time of each submitted work. There are several thousand rows, so I'm looking for an efficient solution.

Decision logic:

If a report is the last in its group (k == n)

end = time

start = end - length

If a report is not last, and followed by one that involved no wait (k < n & lead(wait) == 0)

end = lead(start) - 1

start = end - length

If a report is not last, and followed by one that did involve a wait (k < n & lead(wait) == 1)

end = lead(start) - 0.5 * length

start = end - length

As such, computation would start at the last report in each group, and then loop backwards over the rows up to the first row.

Desired output:

+-------+---+---+---------------------+--------+------+----------+----------+
| group | k | n |        time         | length | wait |   end    |  start   |
+-------+---+---+---------------------+--------+------+----------+----------+
| A     | 1 | 5 | 2017-10-17 12:43:29 |   17.5 |    1 | 11:47:48 | 11:30:18 |
| A     | 2 | 5 | 2017-10-17 12:44:52 |   45.5 |    0 | 12:34:18 | 11:48:48 |
| A     | 3 | 5 | 2017-10-17 12:45:58 |   17.5 |    1 | 13:14:33 | 12:57:03 |
| A     | 4 | 5 | 2017-10-17 13:45:31 |      5 |    1 | 13:28:18 | 13:23:18 |
| A     | 5 | 5 | 2017-10-17 13:46:48 |   17.5 |    0 | 13:46:48 | 13:29:18 |
| B     | 1 | 3 | 2017-11-14 12:07:18 |   45.5 |    1 | 10:01:15 | 9:15:45  |
| B     | 2 | 3 | 2017-11-14 12:14:43 |   45.5 |    1 | 11:09:30 | 10:24:00 |
| B     | 3 | 3 | 2017-11-14 12:17:45 |   45.5 |    1 | 12:17:45 | 11:32:15 |
+-------+---+---+---------------------+--------+------+----------+----------+

With all the table layout `+-----+---+`... your data is easy to see, but very difficult to import into R to experiment on. Would you mind editing your input data to make it friendlier to copy/paste into R? Adding `dput(your_sample_data)` would be perfect... — Gregor Thomas, Sep 12 '18 at 18:57
I have asked similar questions before, and **data.table** package is the most efficient way to go. Check out my post and Arun's answer to get some ideas [Identify Value Changes and Summarize Values](https://stackoverflow.com/questions/39212530/identify-a-value-changes-date-and-summarize-the-data-with-sum-and-diff-in-r) — lemonC, Sep 12 '18 at 19:22
Thanks for including sample data. When I load it into my system, the times come out differently (eg first row 09:43:29 instead of 12:43:29), presumably because one or both of our respective systems is imposing our local time zone on the data. Does anyone know how to share and/or receive POSIXct this way but without tz shift? — Jon Spring, Sep 12 '18 at 20:22
It seems like the `wait` values are reversed from your input to your output. Is that intended? — Jon Spring, Sep 12 '18 at 20:33
@JonSpring I added timezone info to the sample data. It should come out as intended on your end now. I also corrected the `wait` values, which were reversed by mistake. — mnml, Sep 12 '18 at 20:55
Thanks, those work for me. Am I interpreting the logic correctly here? If the next one had a wait, assume a 50% cushion (in terms of current rpt) between the current job end and the next start, otherwise assume a 1 minute cushion? — Jon Spring, Sep 12 '18 at 22:34
@JonSpring that's exactly right. Thanks so much for looking into this! — mnml, Sep 12 '18 at 22:47
The desired output seems inconsistent with the rules as described. Looking at line B.2, I'd think we'd apply the third rule. Shouldn't it end 45.5 min * 0.5 = 22.75 min = 22' 45" before B.3, which would be 11:09:30, not 11:19:30? Aside from that, it looks like there's a fast way to do it by working from end of the last report in each group, and just track the cumulative duration back in time, where each line adds either length + 1 or length * 1.5. — Jon Spring, Sep 13 '18 at 04:15
@JonSpring fyi, i corrected the desired output -- you were right that there was an error — mnml, Sep 13 '18 at 17:33

Jon Spring · Accepted Answer · 2018-09-13T14:56:47.150

Here's a tidyverse approach without loops.

After drawing a picture, I realized that each report's start just pushes back in time from the end of the group, where that report either adds zero cushion (if it's last), 1 min (if the following report didn't wait), or half of current report length (if the following report waited).

So if we reverse the order, we just need to track how much delay to add for each row, add that to each length, then take the cumulative sum of those. That's how much time to subtract from the group's end time (which in this order is the time in the first row of each group).

output <- df %>%
  arrange(group, -k) %>%
  group_by(group) %>%
  mutate(wait = as.logical(wait)) %>%
  mutate(delay = case_when(k == n      ~ 0,  
                           # is *next* rpt no wait? (use lag since order reversed)
                           lag(!wait)  ~ 1,
                           TRUE        ~ 0.5 * length),
         pushback_alone = length + delay,
         pushback_cumul = cumsum(pushback_alone),
         # So the last shall be first, and the first last...
         start = first(time) - seconds(pushback_cumul*60),              
         end = start + seconds(length*60)
  ) %>% ungroup() %>%  # EDIT: to make the table ungrouped like it started
  arrange(group, k)

Result

output
# A tibble: 8 x 11
# Groups:   group [2]
  group     k     n time                length wait  delay pushback_alone pushback_cumul start               end                
  <chr> <int> <int> <dttm>               <dbl> <lgl> <dbl>          <dbl>          <dbl> <dttm>              <dttm>             
1 A         1     5 2017-10-17 12:43:29   17.5 TRUE   1              18.5          136.  2017-10-17 11:30:18 2017-10-17 11:47:48
2 A         2     5 2017-10-17 12:44:52   45.5 FALSE 22.8            68.2          118   2017-10-17 11:48:48 2017-10-17 12:34:18
3 A         3     5 2017-10-17 12:45:58   17.5 TRUE   8.75           26.2           49.8 2017-10-17 12:57:03 2017-10-17 13:14:33
4 A         4     5 2017-10-17 13:45:31    5   TRUE   1               6             23.5 2017-10-17 13:23:18 2017-10-17 13:28:18
5 A         5     5 2017-10-17 13:46:48   17.5 FALSE  0              17.5           17.5 2017-10-17 13:29:18 2017-10-17 13:46:48
6 B         1     3 2017-11-14 12:07:18   45.5 TRUE  22.8            68.2          182   2017-11-14 09:15:45 2017-11-14 10:01:15
7 B         2     3 2017-11-14 12:14:43   45.5 TRUE  22.8            68.2          114.  2017-11-14 10:24:00 2017-11-14 11:09:30
8 B         3     3 2017-11-14 12:17:45   45.5 TRUE   0              45.5           45.5 2017-11-14 11:32:15 2017-11-14 12:17:45

that works exactly as intended and is nicely adaptable. thanks so much! — mnml, Sep 13 '18 at 14:36

Compute variables conditionally and sequentially within groups (dplyr)

1 Answers1