Three methods come to mind:
- Naive
data.table
method.
- Memoized
data.table
method.
- Using the
environment
object.
1. Naive data.table
method
I would use data.table
here because once you set a key, individual row lookups tend to be twice as fast as tidyverse
.
setDT(themes)
setkey(themes, theme_id)
lookup_parent <- function(themes, id_to_lookup) {
id_parent <- themes[theme_id == id_to_lookup, parent_theme_id]
if (is.na(id_parent)) {
return(id_to_lookup)
}
lookup_parent(themes, id_parent)
}
themes[,
ancestor_theme_id := lookup_parent(themes, theme_id),
by = .I
]
# Index: <theme_id>
# theme_id theme_name parent_theme_id ancestor_theme_id
# <num> <char> <num> <num>
# 1: 206 Seasonal NA 206
# 2: 207 Advent 206 206
# 3: 208 City 207 206
This is still basically going to loop over every row but it reads more nicely.
2. Memoized data.table
The above will be slow with large data or deeply nested children. A cheap and easy way to make it quicker is to use the memoise
package to cache previous lookups, so we don't have to recurse all the way to the parent each time.
library(memoise)
lookup_parent_m <- memoise(lookup_parent)
themes[,
ancestor_theme_id := lookup_parent_m(themes, theme_id),
by = .I
]
Although data.table
is faster than tidyverse
, both are relatively slow for these kinds of operations compared to hashing or the built-in environment object. From the benchmarks in this answer you can see several alternatives are considerably faster than data.table
:
Unit: microseconds
expr min lq mean median uq max neval
lookup_hash[[x]] 10.767 12.9070 22.67245 23.2915 26.1710 68.654 100
test_lookup_list[[x]] 847.700 853.2545 887.55680 863.0060 893.8925 1369.395 100
test_lookup_dt[x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273 100
test_lookup_env[[x]] 1.588 1.9450 4.61595 2.5255 6.6430 27.977 100
3. Using the environment
object
This is relatively straightforward to do. However, there is a fixed cost of creating the environment, which takes no time at all with the example data but may be noticeable with your real data.
lookup_list <- list()
lookup_list[as.character(
themes$theme_id
)] <- themes$parent_theme_id
lookup_env <- list2env(lookup_list)
lookup_parent_in_env <- function(id, env = lookup_env) {
id_parent <- env[[as.character(id)]]
if (is.na(id_parent)) {
return(id)
}
lookup_parent_in_env(id_parent, env)
}
# Let's memoise for good measure
lookup_parent_in_env_m <- memoise(lookup_parent_in_env)
themes$parent_theme_id <- sapply(
themes$theme_id,
lookup_parent_in_env_m
)
themes
# theme_id theme_name parent_theme_id
# 1 206 Seasonal 206
# 2 207 Advent 206
# 3 208 City 206
This is not a common pattern so it does sacrifice some readability, but I imagine it will be significantly faster than the data.table
option.