I'm trying to compute two variables on the basis of a set of basic decision rules. However, the computation for most rows depends on values in other rows, which requires this computation to be performed sequentially. I can't figure out how to do this efficiently (ideally in the tidyverse). Thanks in advance!
Data:
Each row of the dataset represents a submitted work report. Each report is nested in a group
, with k
indicating the index of the report in the group, and n
representing the number of reports in the group. time
is the timestamp of the submission, length
is the approximate length of the work, in minutes, and wait
represents whether there was a wait before starting the work (wait == 1
) or whether the work was started immediately after finishing the previous one (wait == 0
)
+-------+---+---+---------------------+--------+------+
| group | k | n | time | length | wait |
+-------+---+---+---------------------+--------+------+
| A | 1 | 5 | 2017-10-17 12:43:29 | 17.5 | 1 |
| A | 2 | 5 | 2017-10-17 12:44:52 | 45.5 | 0 |
| A | 3 | 5 | 2017-10-17 12:45:58 | 17.5 | 1 |
| A | 4 | 5 | 2017-10-17 13:45:31 | 5 | 1 |
| A | 5 | 5 | 2017-10-17 13:46:48 | 17.5 | 0 |
| B | 1 | 3 | 2017-11-14 12:07:18 | 45.5 | 1 |
| B | 2 | 3 | 2017-11-14 12:14:43 | 45.5 | 1 |
| B | 3 | 3 | 2017-11-14 12:17:45 | 45.5 | 1 |
+-------+---+---+---------------------+--------+------+
Edit: importable data example
structure(list(group = c("A", "A", "A", "A", "A", "B", "B", "B"
), k = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L), n = c(5L, 5L, 5L, 5L,
5L, 3L, 3L, 3L), time = structure(c(1508258609.388, 1508258692.614,
1508258758.234, 1508262331.385, 1508262408.434, 1510679238.849,
1510679683.961, 1510679865.964), class = c("POSIXct", "POSIXt"
), tzone = "America/New_York"), length = c(17.5, 45.5, 17.5, 5, 17.5, 45.5, 45.5,
45.5), wait = c(1, 0, 1, 1, 0, 1, 1, 1)), row.names = c(NA, -8L
), class = "data.frame")
I'm trying to compute two new values: start
and end
, the start and end time of each submitted work. There are several thousand rows, so I'm looking for an efficient solution.
Decision logic:
If a report is the last in its group (k == n
)
end = time
start = end - length
If a report is not last, and followed by one that involved no wait (k < n & lead(wait) == 0
)
end = lead(start) - 1
start = end - length
If a report is not last, and followed by one that did involve a wait (k < n & lead(wait) == 1
)
end = lead(start) - 0.5 * length
start = end - length
As such, computation would start at the last report in each group, and then loop backwards over the rows up to the first row.
Desired output:
+-------+---+---+---------------------+--------+------+----------+----------+
| group | k | n | time | length | wait | end | start |
+-------+---+---+---------------------+--------+------+----------+----------+
| A | 1 | 5 | 2017-10-17 12:43:29 | 17.5 | 1 | 11:47:48 | 11:30:18 |
| A | 2 | 5 | 2017-10-17 12:44:52 | 45.5 | 0 | 12:34:18 | 11:48:48 |
| A | 3 | 5 | 2017-10-17 12:45:58 | 17.5 | 1 | 13:14:33 | 12:57:03 |
| A | 4 | 5 | 2017-10-17 13:45:31 | 5 | 1 | 13:28:18 | 13:23:18 |
| A | 5 | 5 | 2017-10-17 13:46:48 | 17.5 | 0 | 13:46:48 | 13:29:18 |
| B | 1 | 3 | 2017-11-14 12:07:18 | 45.5 | 1 | 10:01:15 | 9:15:45 |
| B | 2 | 3 | 2017-11-14 12:14:43 | 45.5 | 1 | 11:09:30 | 10:24:00 |
| B | 3 | 3 | 2017-11-14 12:17:45 | 45.5 | 1 | 12:17:45 | 11:32:15 |
+-------+---+---+---------------------+--------+------+----------+----------+