4

I am trying to create an edge list based on binary splits. If I have a data frame that only contains the node number and some other metric, then I can manually create an edge list for the nodes. For example, if my data frame looks like this:

dfTest <- data.frame(
  node = c(1,2,3,4,5),
  var = c("milk", NA, "coffee", NA, NA),
  isLeaf = c(F, T, F, T, T)
)

> dfTest
  node    var isLeaf
1    1   milk  FALSE
2    2   <NA>   TRUE
3    3 coffee  FALSE
4    4   <NA>   TRUE
5    5   <NA>   TRUE

Then, based on the var or isLeaf column, I can manually create an edge list to connect the nodes. For example, As node 2 is a leaf, I know that node 1 must go to node 2. Then (as they are binary splits) I know node 1 must also connect to node 3. And as node 4 and 5 are leaf nodes, I know that they must split on node 3.

Manually creating the edge list would look like this:

edges <- data.frame(
  from = c(1, 1, 3, 3),
  to   = c(2, 3, 4, 5)
)

The to column is easy to find... it will always be c(2:length(dfTest$nodes)). In this case 2,3,4,5. But the from column is proving difficult to find.

Just for a visual aid, the resulting tree would look like this:

example tree

Is there any way to do this without having to manually work out the edges?


EDIT: In response to an answer, I'm adding a slightly larger dataset to use:

dfTest <- data.frame(
  node = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11),
  var = c("milk", "milk", NA, NA, "coffee", "sugar", NA, NA, "milk", NA, NA),
  isLeaf = c(F, F, T, T, F, F, T, T, F, T, T)
)

A little explanation: From the var column I know that milk (the root/node 1) splits to another milk (node 2). I can then see that node 2 splits to NA (node 3) and NA (node 4). As I know they are binary splits, I know that node 2 cant split any further. So, I must go back to the previous node that only had 1 split… in this case node 1 (i.e., milk) which then splits to the right on coffee (node 5). Again, as they are binary splits, I now know that coffee (node 5) must split to sugar (node 6). Sugar (node 6) is followed by 2 NAs (node 7 & 8 ). Now, I must go back to coffee (node 5) and split to the right to get milk (node 9) which splits to 2 NAs (node 10 &11)

The desired node/edge list should look like this:

edges <- data.frame(
  from = c(1,2,2,1,5,6,6,5,9,9),
  to   = c(2,3,4,5,6,7,8,9,10,11)
)

Which would produce a tree like this: bigger tree

Electrino
  • 2,636
  • 3
  • 18
  • 40
  • are the nodes in dfTest in topological order? – user20650 Aug 26 '21 at 14:28
  • to be honest, Im googling topological ordering of graphs now. I had not heard of this term before. So, im not sure – Electrino Aug 26 '21 at 14:31
  • I mean from your data if you swapped the rows with node 1 and 3 in your data you would get a different graph. So to get a specific graph we need to assume some ordering of the nodes in dfTest – user20650 Aug 26 '21 at 14:38
  • You are correct in saying that if we swapped node 1 and 3, we would get a different graph. So (from my limited and brief understanding) I would say, yes, the nodes in dfTest are in topological order – Electrino Aug 26 '21 at 14:41
  • You can't get the data in a better format? I'd be annoyed if I got data like this. That said, with just this data you could create groups based on a running count of isLeaf values (`rleid` is useful), but I'm not sure if that would scale to a larger dataset. How much bigger / deeper will the tree need to get? – camille Aug 26 '21 at 23:00
  • the data is all I have to work with. The tree sizes will vary,but not much deeper. Maximum depth is probably around 5 or 6. Average tree depth is 4. They should all be quite shallow trees – Electrino Aug 26 '21 at 23:10
  • I think it's better to provide the desired output of newly updated data as well as some explanations. – ThomasIsCoding Aug 30 '21 at 22:33
  • I guess what you want is using BFS traversal to construct a graph. However, I suspect it is not feasible if with BFS given only. You may need DFS as well. Maybe this would help you https://stackoverflow.com/questions/32766824/how-to-construct-a-tree-with-its-bfs-and-dfs-traversal – ThomasIsCoding Aug 31 '21 at 07:33

1 Answers1

3

As per your update, perhaps you can try the code below

grps <- split(dfTest, ~ cumsum(!isLeaf))

edges <- do.call(
  rbind,
  lapply(
    grps,
    function(x) {
      with(x, expand.grid(from = node[!isLeaf], to = node[isLeaf]))
    }
  )
)

for (k in seq_along(grps)) {
  if (nrow(grps[[k]]) == 1) {
    lleaf <- with(grps[[k + 1]], node[!isLeaf])
    rleaf <- with(grps[[k + 2]], node[!isLeaf])
    edges <- rbind(edges, data.frame(from = grps[[k]]$node, to = c(lleaf, rleaf)))
  }
}

edges <- `row.names<-`(edges[with(edges, order(from, to)), ], NULL)

which gives

> edges
   from to
1     1  2
2     1  5
3     2  3
4     2  4
5     5  6
6     5  9
7     6  7
8     6  8
9     9 10
10    9 11

Visualization

library(igraph)
graph_from_data_frame(edges) %>%
  plot(layout = layout_as_tree)

enter image description here

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • King of `igraph` :D – Anoushiravan R Aug 30 '21 at 20:45
  • 1
    @AnoushiravanR Haha, thank you. I am still learning `igraph` :) – ThomasIsCoding Aug 30 '21 at 20:50
  • You are joking right?! I will start soon and will have tons of questions :D – Anoushiravan R Aug 30 '21 at 20:51
  • @AnoushiravanR No kidding. `igraph` has so many functions for many applications, but what I know is only a small piece of it for some common graph problems. – ThomasIsCoding Aug 30 '21 at 20:53
  • 1
    @AnoushiravanR And of course, your questions will be a good opportunity for me to learn more as well :) – ThomasIsCoding Aug 30 '21 at 20:56
  • This is a great answer and definitely something for me to start working with. However, it's not quite correct for what I'm trying to do. My tree only has binary splits. Your method can produce more than 2 splits on one node. I'm trying to adapt your example. – Electrino Aug 30 '21 at 22:02
  • @Electrino Maybe I didn't apply a proper `dfTest` for the binary split. You can try my code on your data and see if it works. – ThomasIsCoding Aug 30 '21 at 22:04
  • Unfortunately, it is still producing more than binary splits. I'll edit my question and add a slightly larger data frame as an example you can use to illustrate the non-binary splits... it looks like the issue is with the `with(dfTest, setNames(data.frame(embed(node[!isLeaf], 2)[, 2:1]), c("from", "to")))` line – Electrino Aug 30 '21 at 22:21
  • @Electrino Then what is the expected output of your new data? – ThomasIsCoding Aug 30 '21 at 22:30
  • I've re-edited my question to include the output – Electrino Aug 30 '21 at 22:40
  • @Electrino i updated my solution so you can check with it. – ThomasIsCoding Aug 30 '21 at 23:18
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/236577/discussion-between-electrino-and-thomasiscoding). – Electrino Aug 30 '21 at 23:32