2

I have a table that show connections between events:

library(data.table)
df = data.table(p1 = c("x0", "x0", "x1", "x2", "x3"),
                p2 = c("x1", "x2", "x3", "x3", "x4"))

Here is an illustration:

enter image description here

The next event may happen only if all previous events have already happened. For example, event x3 may happen only after x1 and x2 irrespective of their sequence.

How can I convert the df table into following one (where all events appear in some permissible order), in a data.table way:

df_required = data.table(p = c("x0", "x1", "x2", "x3", "x4", 
                               "x0", "x1", "x2", "x3", "x4"),
                         sequence = c(1, 2, 3, 4, 5, 1, 3, 2, 4, 5),
                         group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2))

The required table shows two possible groups of connections: x0-x1-x2-x3-x4 and x0-x2-x1-x3-x4. There are two possible ways because two values may immediately follow x0: x1 or x2. The sequence is written above the circles in the illustration as well.

Frank
  • 66,179
  • 8
  • 96
  • 180
Serhii
  • 362
  • 4
  • 15
  • in your source `df`, "`x3`" occurs three times... is this an error? – Wimpel Oct 12 '18 at 10:14
  • No, this is not an error. I'll add a picture that illustrate connections in couple of minutes. – Serhii Oct 12 '18 at 10:17
  • 1
    You definitely should use igraph. Do the paths start and end always at the same respective vertices? Then you could do `library(igraph); lapply(all_shortest_paths(graph_from_data_frame(df), "x0", "x4")$res, as.vector)` and continue from there. – Roland Oct 12 '18 at 10:30
  • Ronald, thanks! It gives sequence, all the rest I can do. But data.table way would be better. – Serhii Oct 12 '18 at 10:33
  • data.table is not a tool for dealing with graphs. – Roland Oct 12 '18 at 10:34
  • I know, but graph is formed from the table. – Serhii Oct 12 '18 at 10:36
  • 1
    I doubt you can find a data.table approach that is more efficient than igraph. But of course, you can combine data.table and igraph. It might be possible to improve my suggested approach. I'm not an expert with igraph since I usually don't deal with graphs. – Roland Oct 12 '18 at 10:41

2 Answers2

2

I just post this since it gives the same output as Rolands suggestion:

(i'll remove it if its non-sense)

data:

library(data.table)
df = data.table(p1 = c("x0", "x0", "x1", "x2", "x3"),
                p2 = c("x1", "x2", "x3", "x3", "x4"))

code:

restElements <- setdiff(df$p1, df$p2)
ans <-
    t(do.call(
        expand.grid, c(restElements, unique(split(df$p2,df$p1)))
        ))

group = rep(1:ncol(ans), each = nrow(ans))

p     = c( ans )

sequence = as.numeric(factor(p))

data.table(p, sequence, group)

result:

#    p sequence group
#1: x0        1     1
#2: x1        2     1
#3: x3        4     1
#4: x4        5     1
#5: x0        1     2
#6: x2        3     2
#7: x3        4     2
#8: x4        5     2

please note:

  • make sure when you set the factor: factor(p), you get the correct order. (by default factor levels are just sorted. Works with this example, might not work with others.)

  • Instead of my ans it's probably wiser to use the igraph method.


So you can combine both:

borrowed from @Roland

lvls <- levels(factor(c(df$p1, df$p2)))
library(igraph);
tmp <- lapply(all_shortest_paths(graph_from_data_frame(df), lvls[1], lvls[length(lvls)])$res, as.vector)
ans <- sapply(tmp, function(x) { lvls[x] })

You can use this ans. Make sure you later use: sequence = as.numeric(factor(p, lvls))

Andre Elrico
  • 10,956
  • 6
  • 50
  • 69
  • Andre, In group 1 x2 is missing, while in group 2 x1 is missing. Perhaps I can modify your solution to get the required one. Thanks for the idea! – Serhii Oct 12 '18 at 10:42
  • I dont really understand. When we take your picture. If you go the "top-way" there is no `x2`. If we go the "lower-way" there is no `x1`. – Andre Elrico Oct 12 '18 at 10:47
  • You are right, some extra explanation should be added. The chart show connections. x0..xn may be treated as events. The next event may happen only if all previous events have already happened. In this way, x3 may happen only after x1 and x2. – Serhii Oct 12 '18 at 11:12
  • Well in that case you just need to "insert" the missing factor levels on the meaning full spots. – Andre Elrico Oct 12 '18 at 11:15
  • How can in that logic `x1` "happen" after `x2` ? – Andre Elrico Oct 12 '18 at 11:16
  • Let me give you an example. Suppose we have two types of fruits: apples and peaches. x1 means apples are eaten, x2 means peaches are eaten. x3 means fruits are eaten. Some may start from any fruit he/she likes. But fruits are eaten (x3) only when apples (x1) and peaches are eaten (x2). – Serhii Oct 12 '18 at 11:22
2

You could assign a rank to each node (assuming you have a graph for which this makes any sense)...

vdf = data.table(p = sort(unique(unlist(df[, c("p1", "p2")]))))

i = 0L
vdf[, r := 0L]
while (any(vdf[r == i, p] %in% df$p1)){
  vdf[r == i, r := r + !df[.(p), on=.(p1), p %in% setdiff(p1, p2)]]
  i = i + 1L
}

    p r
1: x0 0
2: x1 1
3: x2 1
4: x3 2
5: x4 3

If there's a unique first event, x0, then thanks to @Roland, here's a simpler way:

library(igraph)
vdf[, r := as.vector(distances(graph_from_data_frame(df), "x0"))]

Then, for each rank having more than one node, take all permutations (here, borrowing from Generating all distinct permutations of a list in R)...

wdf = vdf[, do.call(cbind, lapply(split(.I, r), function(x) as.data.table(
  gtools::permutations(length(x), length(x), x)
)))]

   0.V1 1.V1 1.V2 2.V1 3.V1
1:    1    2    3    4    5
2:    1    3    2    4    5

The values in wdf are row numbers (see ?.I) of vdf, so...

mdf = melt(wdf[, g := .I], id = "g", value.name = "w")[order(g, variable)]
vdf[mdf$w, .(p, g = mdf$g, r)][, seq := rowid(g)][]

     p g r seq
 1: x0 1 0   1
 2: x1 1 1   2
 3: x2 1 1   3
 4: x3 1 2   4
 5: x4 1 3   5
 6: x0 2 0   1
 7: x2 2 1   2
 8: x1 2 1   3
 9: x3 2 2   4
10: x4 2 3   5

So g is the "group" mentioned in the OP; r is the rank; seq is the sequence within the group (useful so that the sorting of the table is explicit).


Comment. I would stop after assigning the rank/depth attribute to each node in vdf. All the information about feasible sequences of events is here, but enumerating them (as in the OP's output) can be very costly, in terms of computational time and space, and so should be avoided if possible.

The number of permutations for events x sharing the same rank is is factorial(length(x)), so for example if x has a length of 10, the matrix returned has dimensions dim(gtools::permutations(10, 10)) = 3628800 x 10. My computer hangs when trying to compute it.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • The rank/depth seems like something that should be in igraph, but I couldn't find it. It makes sense that the permutations thing is not in igraph since the OP's desired output is an inefficient way to represent the information already contained in the rank/depth. – Frank Oct 12 '18 at 16:08
  • 1
    `distances(graph_from_data_frame(df), "x0")` – Roland Oct 12 '18 at 20:05
  • Frank, thank for the comprehensive answer! You mentioned that the desired output is not the best one. From your perspective, what would be the efficient way to represent output? – Serhii Oct 13 '18 at 08:07
  • @Sergiy Cool, I've added an explanation of what I meant at the end of the question now. – Frank Oct 13 '18 at 15:01