I have a data like this:
structure(list(step_origin = c(4897L, 3105L, 129L, 2689L, 2945L,
161L), step_destination = c(3105L, 1057L, 2689L, 2945L, 3201L,
673L)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000001a52ad81ef0>)
in human friendly form it looks like:
step_origin step_destination
1: 4897 3105
2: 3105 1057
3: 129 2689
4: 2689 2945
5: 2945 3201
6: 161 673
each row represents a step in some process, first column indicates the origin of the step and the second column indicates where the step ends.
If a step_destination
in one row is the same as step_origin
of another row, then these two steps are related.
I want to find all related steps and order them first to last (as a first step is the one that originates at a number which is not recorded as destination on any other row, similarly, the sequence of steps end with a destination which is not at the same time also an origin).
I can imagine two desirable outcomes I would like to obtain:
- a list, where each element of list stores vector of the related steps.
- a data table where each row stores the related steps and number of columns in the data table corresponds to the length of the longest sequence of steps.
The data table in this case would look like:
sequence_id step_1 step_2 step_3 step_4
1: 1 129 2689 2945 3201
2: 2 161 673 NA NA
3: 3 4897 3105 1057 NA
Now I would like a way that dynamically identifies how many columns the resulting table should have, but in practice I know that there will be no more than 12 consecutive steps.
EDIT:
the original question has been already answered, however my real scenario turns out to be a bit more complicated than i originally anticipated.
The process described above can actually move from one origin to two different destinations.
An example of data:
structure(list(step_origin = c(3105, 2689, 2689, 1610), step_destination = c(2689,
2945, 3201, 6730), time = c("2019-03-27 13:24:07", "2019-03-27 20:46:58",
"2019-03-28 16:02:57", "2019-03-28 16:12:44")), row.names = c(NA,
-4L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001a52ad81ef0>)
Which looks like:
step_origin step_destination time
1: 3105 2689 2019-03-27 13:24:07
2: 2689 2945 2019-03-27 20:46:58
3: 2689 3201 2019-03-28 16:02:57
4: 1610 6730 2019-03-28 16:12:44
Which basically means that from 2689
the process splits to 2945
and 3201
.
Note that one destination is always reached from exactly one origin, but one origin can have multiple destinations.
I can get to:
sequence_id step_1 step_2 step_3
1: 1 3105 2689 2945
2: 2 2689 3201 NA
3: 3 1610 6730 NA
Using the approaches already proposed, however, in this case, I would like to have
sequence_id step_1 step_2 step_3
1: 1 3105 2689 2945
2: 2 3105 2689 3201
3: 3 1610 6730 NA
Which would indicate that destinations 2945 and 3201 were reached from start at 3105.