2

The themes dataset from Rebrickable includes for each theme its ID and its parent's ID (columns have been renamed here), which may recurse (ID may have grandparents, great-grandparents, etc.) Here is an example which follows the parent chain City -> Advent -> Seasonal:

themes <- data.frame(theme_id = c(206, 207, 208), 
                     theme_name = c("Seasonal", "Advent", "City"), 
                     parent_theme_id = c(NA, 206, 207))

Here is my base R code to follow the parent IDs up until reaching NA, meaning no parent:

for (i in 1:nrow(themes)) {
  orig_id <- themes[i,]$theme_id
  cur_id <- orig_id
  repeat {
    par_id <- themes[themes$theme_id == cur_id,]$parent_theme_id
    if (is.na(par_id)) break
    cur_id <- par_id
  }
 
  themes[themes$theme_id == orig_id,]$ancestor_theme_id = cur_id
}

Is there a nicer tidyverse or base R way of creating a new column that consists of the greatest ancestor IDs? I can't think of a nice way to recursively follow ID "pointers" without operating one row at a time and having a loop somewhere.

zephryl
  • 14,633
  • 3
  • 11
  • 30
qwr
  • 9,525
  • 5
  • 58
  • 102
  • 1
    Sorry but I do not exactly understand what you want. Maybe you could explain in detail? – TarJae Dec 30 '22 at 21:05
  • I don't know of an easy tidyverse way to do things, possibly because open recursion like that (in the hands of unaware users) can run amuck rather easily (and, since R doesn't do tail-recursion efficiently, it uses stack-space rather inefficiently). – r2evans Dec 30 '22 at 21:06
  • @r2evans there are no loops or missing IDs, if it helps – qwr Dec 30 '22 at 21:07
  • I'm not saying that recursion in this example is problematic, I'm saying that I do not know of any standard functions in dplyr/tidyr/purrr that deal with recursive processing, likely because it's easy to get into a bad mess using recursion incorrectly. (Loops are a different story of recursion pain.) I suspect that you can do something a little more frame-friendly by repeatedly `left_join`ing it on itself until all RHS-`theme_id` (joined in) are `NA`. – r2evans Dec 30 '22 at 21:21
  • 1
    @r2evans I actually originally tried left_join with itself before I realized parents could recurse. If ancestors only goes up to grandparents then I would only need two joins. – qwr Dec 30 '22 at 21:29

3 Answers3

3

Three methods come to mind:

  1. Naive data.table method.
  2. Memoized data.table method.
  3. Using the environment object.

1. Naive data.table method

I would use data.table here because once you set a key, individual row lookups tend to be twice as fast as tidyverse.

setDT(themes)
setkey(themes, theme_id)

lookup_parent <- function(themes, id_to_lookup) {
    id_parent <- themes[theme_id == id_to_lookup, parent_theme_id]

    if (is.na(id_parent)) {
        return(id_to_lookup)
    }

    lookup_parent(themes, id_parent)
}

themes[,
    ancestor_theme_id := lookup_parent(themes, theme_id),
    by = .I
]

# Index: <theme_id>
#    theme_id theme_name parent_theme_id ancestor_theme_id
#       <num>     <char>           <num>             <num>
# 1:      206   Seasonal              NA               206
# 2:      207     Advent             206               206
# 3:      208       City             207               206

This is still basically going to loop over every row but it reads more nicely.

2. Memoized data.table

The above will be slow with large data or deeply nested children. A cheap and easy way to make it quicker is to use the memoise package to cache previous lookups, so we don't have to recurse all the way to the parent each time.

library(memoise)
lookup_parent_m <- memoise(lookup_parent)

themes[,
    ancestor_theme_id := lookup_parent_m(themes, theme_id),
    by = .I
]

Although data.table is faster than tidyverse, both are relatively slow for these kinds of operations compared to hashing or the built-in environment object. From the benchmarks in this answer you can see several alternatives are considerably faster than data.table:

Unit: microseconds
                  expr      min        lq       mean    median        uq      max neval
      lookup_hash[[x]]   10.767   12.9070   22.67245   23.2915   26.1710   68.654   100
 test_lookup_list[[x]]  847.700  853.2545  887.55680  863.0060  893.8925 1369.395   100
     test_lookup_dt[x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273   100
  test_lookup_env[[x]]    1.588    1.9450    4.61595    2.5255    6.6430   27.977   100

3. Using the environment object

This is relatively straightforward to do. However, there is a fixed cost of creating the environment, which takes no time at all with the example data but may be noticeable with your real data.

lookup_list <- list()

lookup_list[as.character(
    themes$theme_id
)] <- themes$parent_theme_id

lookup_env <- list2env(lookup_list)

lookup_parent_in_env <- function(id, env = lookup_env) {
    id_parent <- env[[as.character(id)]]

    if (is.na(id_parent)) {
        return(id)
    }

    lookup_parent_in_env(id_parent, env)
}

# Let's memoise for good measure
lookup_parent_in_env_m  <- memoise(lookup_parent_in_env)

themes$parent_theme_id  <- sapply(
    themes$theme_id, 
    lookup_parent_in_env_m
)

themes

#   theme_id theme_name parent_theme_id
# 1      206   Seasonal             206
# 2      207     Advent             206
# 3      208       City             206

This is not a common pattern so it does sacrifice some readability, but I imagine it will be significantly faster than the data.table option.

SamR
  • 8,826
  • 3
  • 11
  • 33
  • I've never used data.table but this looks easier to do rowwise operations – qwr Dec 30 '22 at 21:42
  • 2
    Nice use of memoise! – zephryl Dec 30 '22 at 21:54
  • @thanks! Looking at your answer I wonder if my second logical test is necessary - I thought that a theme could have its own ID as a parent, but actually that makes no sense, so your function is a little more concise. – SamR Dec 30 '22 at 21:59
  • yes, a theme will never have itself as a parent – qwr Dec 30 '22 at 22:19
  • by the way the dataframe is small enough so that performance is not a concern – qwr Dec 30 '22 at 22:27
  • 1
    @qwr OK I've taken out the second logical condition. Re performance, then I would just use whichever of these is most readable. Row-wise lookups scale terribly in R though so I enjoy thinking about performance with these kinds of questions even if it's not an issue here. – SamR Dec 30 '22 at 22:29
3

As @r2evans explains in the comments, there aren't tidyverse functions for the recursive lookup itself, but you can make your code a bit more succinct and tidyverse-y by defining a recursive function and calling it from mutate(map_dbl()):

library(dplyr)
library(purrr)

find_ancestor <- function(id) {
  dat <- cur_data()
  parent <- filter(dat, theme_id == id)$parent_theme_id
  if (is.na(parent)) id else find_ancestor(parent)
}

themes %>% 
  mutate(ancestor_theme_id = map_dbl(theme_id, find_ancestor))
  theme_id theme_name parent_theme_id ancestor_theme_id
1      206   Seasonal              NA               206
2      207     Advent             206               206
3      208       City             207               206
zephryl
  • 14,633
  • 3
  • 11
  • 30
0

this can get complex pretty fast when there are many connections/ multiple parents for a single node.
If so, you could give this igraph-approach a try

mydata <- data.frame(theme_id = c(206, 207, 208, 208, 207, 301, 302), 
                     parent_theme_id = c(NA, 206, 207, 209, 210, NA, 301))

library(tidyverse)
library(igraph)

verts <- unique(as.vector(as.matrix(mydata)))
edges <- mydata %>% select(parent_theme_id, theme_id) %>% drop_na()
g <- graph_from_data_frame(edges, directed = TRUE, vertices = verts[!is.na(verts)])
plot(g)

This results in the following graph.
The 'parents' from each node are the starting-points of the graph (follow the arrows from each node upstream to find them). enter image description here

# get startpoints (which are endpoints in your case)
startpoints <- names(V(g)[degree(g, mode = 'out') > 0 & degree(g, mode = 'in') == 0])
#[1] "206" "301" "209" "210"

#calculate all paths from all startpoints
l <- unlist(lapply(startpoints , function(x) all_simple_paths(g, from=x)), recursive = FALSE)
# get the nodes involved.. the first node is the 'parent'/starting point
ans <- lapply(seq.int(l), function(i) as_ids(l[[i]]))
# [[1]]
# [1] "206" "207"
# [[2]]
# [1] "206" "207" "208"
# [[3]]
# [1] "301" "302"
# [[4]]
# [1] "209" "208"
# [[5]]
# [1] "210" "207"
# [[6]]
# [1] "210" "207" "208"

unique(
  bind_rows(
    lapply(ans, function(x) data.frame(parent = x[1], child = x[2:length(x)]))
  ))

#   parent child
# 1    206   207
# 3    206   208
# 4    301   302
# 5    209   208
# 6    210   207
# 8    210   208

This information, you can join back into your original data

Wimpel
  • 26,031
  • 1
  • 20
  • 37