4

I have a data frame that contains two columns, an ID column and a column with sub ID's that are related to the corresponding ID. The sub ID's can again have sub ID's (in this case the previous sub ID is now an ID).

library(tibble)

df <- tibble(id = c(1, 1, 2, 2, 3, 7), sub_id = c(2, 3, 4, 5, 6, 8))

df

# A tibble: 6 x 2
     id sub_id
  <dbl>  <dbl>
1     1      2
2     1      3
3     2      4
4     2      5
5     3      6
6     7      8

I would like to write a function that finds all sub ID's that are related to an ID. It should return a vector with all sub ID's.

find_all_sub_ids <- function (data, id) {
data %>% ...
}

find_all_sub_ids(df, id = 1)

[1] 2 3 4 5 6

find_all_sub_ids(df, id = 2)

[1] 4 5

find_all_sub_ids(df, id = 9)

[1] NULL

This is very different from everything I have done in R so far and it was hard for me to phrase a good title for this question. So it is possible that with the right phrasing I could have already found an answer by just googling.

My first intuition for solving this was while loops. Since I also do not know how many sublevels there could be the function should continue until all are found. I never used while loops though and don't really know how I could implement them here.

Maybe someone knows a good solution for this problem. Thanks!

Edit: Forgot to assign the tibble to df and to use this argument in the function call.

jpquast
  • 333
  • 2
  • 8
  • 1
    Why `find_all_sub_ids(id = 1)` gives you so many numbers if there is only `2,3`? – Duck Jul 20 '20 at 17:53
  • with `dplyr` and your tibble called DF you could do `find_all_sub_ids <- function (x) dplyr::filter(DF, id==x) %>% .$sub_id` – user12728748 Jul 20 '20 at 17:57

3 Answers3

3

With igraph:

library(igraph)
g <- graph_from_data_frame(d, directed = TRUE)

find_all_subs <- function(g,id){
  #find child nodes, first one being origin
  r <- igraph::subcomponent(g,match(id, V(g)$name),"out")$name
  #remove origin
  as.numeric(r[-1])
}
find_all_subs(g,1)
[1] 2 3 4 5 6

find_all_subs(g,2)
[1] 5 6
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • After trying it out with my actual data I noticed it does not work properly if I use different ID's. ```df <- tibble(id = c(10, 10, 20, 20, 30, 40, 50, 60, 70), sub_id = c(200, 300, 400, 500, 600, 700, 800, 900, 1000))``` If you look for id = 10 you do not get 200 and 300. r would return 400 but is then NULL because r[-1]. – jpquast Jul 21 '20 at 07:42
  • see my edit, again a id vs name problem, this time with the id input to the function. igraph stays a good solution, but id/name logic has to be under control ;) – Waldi Jul 21 '20 at 08:41
  • I think you can use `subcomponent(g,toString(id),"out")$name` to fix the problem – ThomasIsCoding Jul 21 '20 at 08:41
  • @ThomasIsCoding, I tried your suggestion but get an 'invalid vertex name' error. See my edit : a bit trickier. This is however the [suggested way](https://stackoverflow.com/a/20220038/13513328) by the creator of the igraph package. – Waldi Jul 21 '20 at 08:45
  • 1
    I think OP wants the results `find_all_subs(g,10)` for `id = 10`, so you can use a character `"10"` as an input – ThomasIsCoding Jul 21 '20 at 08:49
2

I think it's easiest to formulate this as a graph problem.
Your data.frame describes a directed graph (vertices going from id to sub_id), and you are interested in which nodes are reachable from a certain vertex.

Using tidygraph, this can be achieved as such:

library(tidyverse)
library(tidygraph)

df <- tibble(id = c(1, 1, 2, 2, 3, 7), sub_id = c(2, 3, 4, 5, 6, 8))

find_all_sub_ids <- function (id) {
  if (!(id %in% df$id)) {
    return(NULL)
  }

  
  grph <- df %>% 
    as_tbl_graph(directed = TRUE)
  
  id <- which(grph %>% pull(name) == as.character(id))
  
  grph %>% 
    activate(nodes) %>% 
    mutate(reachable = !is.na(bfs_dist(id))) %>% 
    as_tibble() %>% 
    filter(reachable) %>% 
    pull(name) %>% 
    as.numeric()
}

We see which nodes are reachable (they have a non-NA distance to your given node), we use bfs_dist (see here for explanation).
This gives

> find_all_sub_ids(1)
[1] 1 2 3 4 5 6

> find_all_sub_ids(2)
[1] 2 4 5

> find_all_sub_ids(9)
NULL

The advantage of such an approach is that it can search many levels deep without you needing to write a loop explicitly.

Edit There was a bug in my code, tidygraph::bfs_dist uses a differend id than I expected. Fixed it now.
On the new example:

> find_all_sub_ids(10)
[1]  10 200 300
Bas
  • 4,628
  • 1
  • 14
  • 16
  • Thanks for your answer! This works well and does what it should. I learned some new things here thanks to your explanation. – jpquast Jul 20 '20 at 22:44
  • Also here I noticed that when I test it with my actual data it does not work properly. ``` df <- tibble(id = c(10, 10, 20, 20, 30, 40, 50, 60, 70), sub_id = c(200, 300, 400, 500, 600, 700, 800, 900, 1000)) ``` When I would take another example df and look for id = 10. it returns 400 instead of 10, 200 and 300. – jpquast Jul 21 '20 at 07:46
  • You are completely right - there was a bug in my code with the ID value (it referred to the row number instead of the value of the ID column). See edit for a working example. – Bas Jul 21 '20 at 08:32
0

I did it using a dataframe. The following works.

x= c(1,1,2,2,3,7)
y = c(2, 3, 4, 5, 6, 8)
df <- data.frame(cbind(x,y))
colnames(df) =c('id', 'sub_id')


find_all_sub_ids <- function (df, id_requested) {
  si <- df[df$id==id_requested,]$sub_id
  return(si)
}
find_all_sub_ids(df,id=2)
[1] 4 5
  • Thanks for your answer. But this is the same problem like the answer before. This function does not do what I ask for though. I would like to get all sub ID's of my ID. This one only gives the direct sub ID's and not their sub ID's. Sub ID of 1 is 2 (which has sub ID's 4 and 5) and 3 (which has sub ID 6) – jpquast Jul 20 '20 at 22:19