I am trying to create an edge list based on binary splits. If I have a data frame that only contains the node number and some other metric, then I can manually create an edge list for the nodes. For example, if my data frame looks like this:
dfTest <- data.frame(
node = c(1,2,3,4,5),
var = c("milk", NA, "coffee", NA, NA),
isLeaf = c(F, T, F, T, T)
)
> dfTest
node var isLeaf
1 1 milk FALSE
2 2 <NA> TRUE
3 3 coffee FALSE
4 4 <NA> TRUE
5 5 <NA> TRUE
Then, based on the var
or isLeaf
column, I can manually create an edge list to connect the nodes. For example, As node 2 is a leaf, I know that node 1 must go to node 2. Then (as they are binary splits) I know node 1 must also connect to node 3. And as node 4 and 5 are leaf nodes, I know that they must split on node 3.
Manually creating the edge list would look like this:
edges <- data.frame(
from = c(1, 1, 3, 3),
to = c(2, 3, 4, 5)
)
The to
column is easy to find... it will always be c(2:length(dfTest$nodes))
. In this case 2,3,4,5
. But the from
column is proving difficult to find.
Just for a visual aid, the resulting tree would look like this:
Is there any way to do this without having to manually work out the edges?
EDIT: In response to an answer, I'm adding a slightly larger dataset to use:
dfTest <- data.frame(
node = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11),
var = c("milk", "milk", NA, NA, "coffee", "sugar", NA, NA, "milk", NA, NA),
isLeaf = c(F, F, T, T, F, F, T, T, F, T, T)
)
A little explanation:
From the var
column I know that milk (the root/node 1) splits to another milk (node 2). I can then see that node 2 splits to NA (node 3) and NA (node 4). As I know they are binary splits, I know that node 2 cant split any further. So, I must go back to the previous node that only had 1 split… in this case node 1 (i.e., milk) which then splits to the right on coffee (node 5). Again, as they are binary splits, I now know that coffee (node 5) must split to sugar (node 6). Sugar (node 6) is followed by 2 NAs (node 7 & 8 ). Now, I must go back to coffee (node 5) and split to the right to get milk (node 9) which splits to 2 NAs (node 10 &11)
The desired node/edge list should look like this:
edges <- data.frame(
from = c(1,2,2,1,5,6,6,5,9,9),
to = c(2,3,4,5,6,7,8,9,10,11)
)