How to match linked steps indicated by two columns of a data.table

Question

I have a data like this:

structure(list(step_origin = c(4897L, 3105L, 129L, 2689L, 2945L, 
161L), step_destination = c(3105L, 1057L, 2689L, 2945L, 3201L, 
673L)), row.names = c(NA, -6L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000001a52ad81ef0>)

in human friendly form it looks like:

   step_origin step_destination
1:        4897             3105
2:        3105             1057
3:         129             2689
4:        2689             2945
5:        2945             3201
6:         161              673

each row represents a step in some process, first column indicates the origin of the step and the second column indicates where the step ends.

If a step_destination in one row is the same as step_origin of another row, then these two steps are related.

I want to find all related steps and order them first to last (as a first step is the one that originates at a number which is not recorded as destination on any other row, similarly, the sequence of steps end with a destination which is not at the same time also an origin).

I can imagine two desirable outcomes I would like to obtain:

a list, where each element of list stores vector of the related steps.
a data table where each row stores the related steps and number of columns in the data table corresponds to the length of the longest sequence of steps.

The data table in this case would look like:

   sequence_id step_1 step_2 step_3 step_4
1:           1    129   2689   2945   3201
2:           2    161    673     NA     NA
3:           3   4897   3105   1057     NA

Now I would like a way that dynamically identifies how many columns the resulting table should have, but in practice I know that there will be no more than 12 consecutive steps.

EDIT:

the original question has been already answered, however my real scenario turns out to be a bit more complicated than i originally anticipated.

The process described above can actually move from one origin to two different destinations.

An example of data:

structure(list(step_origin = c(3105, 2689, 2689, 1610), step_destination = c(2689, 
2945, 3201, 6730), time = c("2019-03-27 13:24:07", "2019-03-27 20:46:58", 
"2019-03-28 16:02:57", "2019-03-28 16:12:44")), row.names = c(NA, 
-4L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001a52ad81ef0>)

Which looks like:

   step_origin step_destination                time
1:        3105             2689 2019-03-27 13:24:07
2:        2689             2945 2019-03-27 20:46:58
3:        2689             3201 2019-03-28 16:02:57
4:        1610             6730 2019-03-28 16:12:44

Which basically means that from 2689 the process splits to 2945 and 3201. Note that one destination is always reached from exactly one origin, but one origin can have multiple destinations.

I can get to:

   sequence_id step_1 step_2 step_3
1:           1   3105   2689   2945
2:           2   2689   3201     NA
3:           3   1610   6730     NA

Using the approaches already proposed, however, in this case, I would like to have

   sequence_id step_1 step_2 step_3
1:           1   3105   2689   2945
2:           2   3105   2689   3201
3:           3   1610   6730     NA

Which would indicate that destinations 2945 and 3201 were reached from start at 3105.

chinsoon12 · Accepted Answer · 2019-05-10T13:51:29.053

Another possibility using igraph to construct clusters and then data.table::dcast to get desired wide data table:

library(igraph)
g <- graph_from_data_frame(DF)
seqid <- clusters(g)$membership
dcast(as.data.table(seqid, keep.rownames=TRUE),
    seqid ~ rowid(seqid), 
    value.var="rn")

output:

   seqid    1    2    3    4
1:     1 4897 3105 1057 <NA>
2:     2  129 2689 2945 3201
3:     3  161  673 <NA> <NA>

edit: to address qn edit and comment, still using igraph but finding all possible paths instead of clustering now.

library(igraph)
library(data.table)
DF2 <- structure(list(step_origin = c(3105, 2689, 2689, 1610), step_destination = c(2689,
    2945, 3201, 6730), time = c("2019-03-27 13:24:07", "2019-03-27 20:46:58",
        "2019-03-28 16:02:57", "2019-03-28 16:12:44")), row.names = c(NA,
            -4L), class = c("data.table", "data.frame"))
DF2 <- rbindlist(list(DF2, DF2), idcol="ID")

gDT <- DF2[, .(graph=.(graph_from_data_frame(.SD))), by=.(ID), 
    .SDcols=c("step_origin", "step_destination")]

#create all combinations of roots and leaf nodes
rootleaf <- DF2[, CJ(setdiff(step_origin, step_destination),
        setdiff(step_destination, step_origin)), 
    by=.(ID)][, 
        c("V1", "V2") := lapply(.SD, as.character), .SDcols=c("V1", "V2")]

#get all paths from roots to leaf nodes
#see https://stackoverflow.com/a/25758769/1989480
paths <- rootleaf[, {
        id <- .BY$ID
        g <- gDT[ID==id, graph][[1L]]

        .(.SD[, .(lapply(all_shortest_paths(g, from=V1, to=V2)$res,
                function(sp) transpose(as.data.table(c(id, V(g)[sp]$name))))),
            by=seq_len(.SD[,.N])]$V1)
    },
    by=.(ID)]

#get desired wide output
rbindlist(paths$V1, use.names=TRUE, fill=TRUE)

output:

   V1   V2   V3   V4
1:  1 1610 6730 <NA>
2:  1 3105 2689 2945
3:  1 3105 2689 3201
4:  2 1610 6730 <NA>
5:  2 3105 2689 2945
6:  2 3105 2689 3201

Hi there. This looks good. I have also added a one more request to the question. Any idea how to solve it? — ira, May 09 '19 at 06:58
Thank you! one last question, hopefully: do you know how to apply the search for separate groups in the data.table? i.e. if I had in the original data.table an extra column, which would indicate some id. And some of the steps would be relevant only for that id? — ira, May 10 '19 at 09:53
shouldn't be too hard. should be just adding some `by=.(id)` and do a `c(id, V(g)[sp]$name)` — chinsoon12, May 10 '19 at 10:08
I suppose I would have to change the graph_from_data_frame in some way? — ira, May 10 '19 at 11:08
@ira, i created a dummy sample dataset. it should work now. code can prob be more concise though. — chinsoon12, May 10 '19 at 13:52

score 1 · Answer 2 · answered May 08 '19 at 15:24

1

A possible solution:

DT[, .(step = c(step_origin, step_destination[.N]))
   , by = .(sequence_id = DT[, cumsum(c(TRUE, step_origin[-1] != step_destination[-.N]))])
   ][, dcast(.SD, sequence_id ~ rowid(sequence_id, prefix = "step_"), value.var = "step")
     ][order(step_1)][, sequence_id := .I][]

which gives:

   sequence_id step_1 step_2 step_3 step_4
1:           1    129   2689   2945   3201
2:           2    161    673     NA     NA
3:           3   4897   3105   1057     NA

answered May 08 '19 at 15:24

Jaap

81,064
34
182
193

This looks awesome and does exactly what I need (at least for my simplified example here), could you please try to explain the logic behind? – ira May 08 '19 at 15:27
Would you also know how to handle the case when a process would split at one of the origins? i.e. I would also have in the original table a row such as `step_origin = 3105` and `step_destination = 2090`. In such a case I would like to see in the resulting table a new row which would be `step_1 = 4897`, `step_2 = 3105` and `step_3 = 2090` – ira May 08 '19 at 20:20
I have edited the question to include the request added in the above comment. Do you know how to best tackle this part? – ira May 09 '19 at 06:57
1

@ira I think the igraph approach of chinsoon is the most logical one in your case – Jaap May 09 '19 at 21:42

How to match linked steps indicated by two columns of a data.table

2 Answers2