2

I'm using the {targets} package in a data analysis project where I need to fetch datasets from a remote Oracle database.

My overarching task is twofold:

  1. get data from scattered over different views and tables in the Oracle DB.
  2. compute and wrangle a new data column based on data I pulled from Oracle.
  • those computed columns, although originating from separate database tables, do share enough in common so I can merge them by some related index columns.

Each new column I compute is a "special snowflake", so I've build a dedicated wrangling function for each one, and a higher-order function to call each snowflake-wrangling sub function.

My problem is with iteration over data entities that had been created with targets::tar_target().


Reproducible Example

To accurately convey my problem, I unfortunately need to dedicate a decent amount of code in this example. The first part is only for generating the demo data and mimic the Oracle DB. You can just run the code and skip to the part that follows it.

1. Simulate data: Build a database with 4 tables
just run this code; not critical for understanding the problem

library(dplyr, warn.conflicts = FALSE)
library(babynames)
library(DBI)
library(RSQLite)

set.seed(2021)

simulate_df_from_colnames <- function(vec_of_colnames, desired_nrows, vec_of_ids) {
  
  stopifnot(desired_nrows == length(vec_of_ids))
  
  ncols        <- length(vec_of_colnames)
  n_values     <- ncols * desired_nrows
  vec          <- runif(n = n_values, min = 1, max = 100)
  vec[sample(1:length(vec), 0.2 * length(vec))] <- NA # sprinkle NA randomly in 20% of values
  mat          <- matrix(vec, ncol = ncols)
  df           <- as.data.frame(mat)
  colnames(df) <- vec_of_colnames
  df$id        <- vec_of_ids
  df           <- df[,c(ncol(df),1:(ncol(df)-1))] # so the id column move from last to first position

  return(df)
}

work_related <- c("acceptance", "accountability", "achievement", "adaptability", "adventure", "authenticity", "authority", "autonomy", "balance", "boldness", "bravery", "candor", "challenge", "clarity", "collaboration", "compassion", "communication", "community", "contribution", "creativity", "curiosity", "dependability", "determination", "diversity", "empathy", "enthusiasm", "equality", "family", "fairness", "flexibility", "friendship", "growth", "happiness", "hard_work", "honesty", "humility", "humor", "impact", "improvement", "ingenuity", "innovation", "kindness", "knowledge", "leadership", "learning", "loyalty", "meaningful_work", "optimism", "ownership", "participation", "patience", "peace", "persistence", "popularity", "power", "quality", "recognition", "relationships", "reliability", "reputation", "respect", "responsibility", "results", "security", "self_improvement", "simplicity", "spirituality", "stability", "success", "sustainability", "teamwork", "tenacity", "time_management", "transparency", "trustworthiness", "wealth", "wisdom", "work_ethic", "work_life_balance")
blood_tests  <- c("white_blood_cell_count", "red_blood_cell_count", "hemoglobin", "hematocrit", "mean_corpuscular_volume", "platelet_count", "sodium", "potassium", "chloride", "carbon_dioxide", "blood_urea_nitrogen", "creatinine", "glucose", "calcium", "total_protein", "albumin", "bilirubin", "alkaline_phosphatase", "ast", "alt", "vitamin_b_12", "methylmalonic_acid", "ferritin")
physical     <- c("systolic_blood_pressure", "diastolic_blood_pressure", "pulse_rate_beats_minute", "height", "weight", "bmi", "waist_circumference", "hip_circumference")
psych_traits <- c("accessible", "active", "adaptable", "admirable", "adventurous", "agreeable", "alert", "allocentric", "amiable", "anticipative", "appreciative", "articulate", "aspiring", "athletic", "attractive", "balanced", "benevolent", "brilliant", "calm", "capable", "captivating", "caring", "challenging", "charismatic", "charming", "cheerful", "clean", "clear_headed", "clever", "colorful", "companionly", "compassionate", "conciliatory", "confident", "conscientious", "abrasive", "abrupt", "agonizing", "aimless", "airy", "aloof", "amoral", "angry", "anxious", "apathetic", "arbitrary", "argumentative", "arrogantt", "artificial", "asocial", "assertive", "astigmatic", "barbaric", "bewildered", "bizarre", "bland", "blunt", "biosterous", "brittle", "brutal", "calculating", "callous", "cantakerous", "careless", "cautious", "charmless", "childish", "clumsy", "coarse", "cold")

my_names <- 
  babynames::babynames %>%
  pull(name) %>%
  unique() %>%
  sample(1000)

df_work_related <- simulate_df_from_colnames(work_related, 1000, vec_of_ids = my_names)
df_blood_tests  <- simulate_df_from_colnames(blood_tests , 1000, vec_of_ids = my_names)
df_physical     <- simulate_df_from_colnames(physical    , 1000, vec_of_ids = my_names)
df_psych_traits <- simulate_df_from_colnames(psych_traits, 1000, vec_of_ids = my_names)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

copy_to(con, df_work_related, name = "DJLNGJN3445_NFKS")
copy_to(con, df_blood_tests , name = "DKFMDGNSQWRE_320586")
copy_to(con, df_physical    , name = "KLDJNSDOIJFW_295868FJDI")
copy_to(con, df_psych_traits, name = "AQNF_223_daqVV")

Created on 2021-10-23 by the reprex package (v2.0.1)


2. My problem starts here; I have a database full of data I want to analyze.
After running the simulation code from above, we now have the con object representing a remote DB. We can explore what tables are inside con:

DBI::dbListObjects(con)
#>                                  table is_prefix
#> 1          <Id> table = AQNF_223_daqVV     FALSE
#> 2        <Id> table = DJLNGJN3445_NFKS     FALSE
#> 3     <Id> table = DKFMDGNSQWRE_320586     FALSE
#> 4 <Id> table = KLDJNSDOIJFW_295868FJDI     FALSE
#> 5            <Id> table = sqlite_stat1     FALSE
#> 6            <Id> table = sqlite_stat4     FALSE

Whomever granted access to the database also told us that it stores data about 1000 people, scattered over 4 different tables.

Table Name Kind of Data Stored There
AQNF_223_daqVV Psychological Measurements
DJLNGJN3445_NFKS Measurements related to employment
DKFMDGNSQWRE_320586 Blood Tests
KLDJNSDOIJFW_295868FJDI Physical Measurements such as height, weight, etc.

3. Walkthrough with one new variable as example
Let's say that I want to compute a new variable that represents whether a person is fun to be with. From the table above I see that table name "AQNF_223_daqVV" contains psychological measures so I understand it's a relevant table. Exploring that data I decide that my new variable "fun_to_be_with" will be the average of existing variables accessible, active, and adaptable.

library(dplyr)

compute_fun_to_be_with <- function(.dat) {
  .dat %>%
    select(id, accessible, active, adaptable) %>%
    mutate(fun_to_be_with = rowMeans(across(c(accessible, active, adaptable))), .keep = "unused")
}
tbl(con, "AQNF_223_daqVV") %>%
  collect() %>%
  compute_fun_to_be_with()
#> # A tibble: 1,000 x 2
#>    id        fun_to_be_with
#>    <chr>              <dbl>
#>  1 Miari               NA  
#>  2 Demariana           NA  
#>  3 Halah               NA  
#>  4 Abdalah             NA  
#>  5 Infiniti            NA  
#>  6 Sydel               63.0
#>  7 Montelle            62.8
#>  8 Rhys                NA  
#>  9 Mijah               73.0
#> 10 Lamontre            NA  
#> # ... with 990 more rows

When I started exploring more interesting variables I could compute, I've come to realize that besides the initial computation (e.g., taking the mean), there are also several wrangling steps that are applicable for some variables but not for others. For example, sometimes I want to round the variable's values up or down, or take a logarithm, or otherwise. So each new variable I compute is a "special snowflake", and I have a wrapper function that orchestrates all those preferences.

compute_snowflake <- function(.dat, snowflake_name) {
  
  switch(snowflake_name,
         "fun_to_be_with" = compute_fun_to_be_with(.dat))
}


wrangle_snowflake <- function(snowflake_name, 
                              raw_data_from_db, 
                              replace_na_with_zero, 
                              take_logarithm, 
                              round = c("up", "down"), 
                              standardize_as_zscore) {

  
  raw_data_from_db %>%
    compute_snowflake(snowflake_name) %>%
    {if (replace_na_with_zero) mutate(., across({{ snowflake_name }}, tidyr::replace_na, 0)) else .} %>%
    {if (take_logarithm) mutate(., across(fun_to_be_with, log)) else .} %>%
    {if (round == "up") mutate(., across(fun_to_be_with, ceiling)) else .} %>%
    {if (round == "down") mutate(., across(fun_to_be_with, floor)) else .} %>%
    {if (standardize_as_zscore)  mutate(., across(fun_to_be_with, scale)) else .}
} 

3a) The need for iteration
For wrangling just one new variable, wrangle_snowflake () can be used as-is:

wrangle_snowflake(snowflake_name = "fun_to_be_with", 
                  raw_data_from_db = tbl(con, "AQNF_223_daqVV") %>% collect(),
                  replace_na_with_zero = FALSE,
                  take_logarithm = TRUE,
                  round = "down",
                  standardize_as_zscore = FALSE)

But the issue is with the scale. My project requires wrangling about 100 new variables. And I don't want to repeat wrangle_snowflake() this way 100 times. If we just save the DB table locally as an object in the environment, we can utilize purrr::pmap() very nicely for iteration:

raw_tbl_psych <- 
  tbl(con, "AQNF_223_daqVV") %>% 
  collect()

tbl_parameters <- 
  tibble::tribble(~snowflake_name, ~raw_data_from_db, ~replace_na_with_zero, ~take_logarithm, ~round, ~standardize_as_zscore,
                "fun_to_be_with", raw_tbl_psych, FALSE, TRUE, "down", FALSE)
tbl_parameters
#> # A tibble: 1 x 6
#>   snowflake_name raw_data_from_db      replace_na_with_zero take_logarithm round
#>   <chr>          <list>                <lgl>                <lgl>          <chr>
#> 1 fun_to_be_with <tibble [1,000 x 71]> FALSE                TRUE           down 
#> # ... with 1 more variable: standardize_as_zscore <lgl>

tbl_parameters %>%
  purrr::pmap(.f = wrangle_snowflake)
#> [[1]]
#> # A tibble: 1,000 x 2
#>    id        fun_to_be_with
#>    <chr>              <dbl>
#>  1 Miari                 NA
#>  2 Demariana             NA
#>  3 Halah                 NA
#>  4 Abdalah               NA
#>  5 Infiniti              NA
#>  6 Sydel                  4
#>  7 Montelle               4
#>  8 Rhys                  NA
#>  9 Mijah                  4
#> 10 Lamontre              NA
#> # ... with 990 more rows

pmap() is very powerful because I can extend tbl_parameters and add many more snowflakes, but the call tbl_parameters %>% purrr::pmap(.f = wrangle_snowflake) will remain the same.

3b) the need for {targets}
One major issue isn't reflected in this example: the data I need to fetch from the remote DB is huge. Each table (e.g., AQNF_223_daqVV) could range 1-10 millions of rows. When that's the case, I don't want to load the entire data into the environment as an R object. Rather, the {targets} package allows me to create a "target" for each giant table, which is stored as an .rds file in the directory. This way I can use the giant table indirectly, without loading it.

finally: my problem

{targets} doesn't work well with my pmap() method. Since I don't want to bring the entire giant data tables to R's environment, I'd rather to simply refer to them by their name. This way, my tbl_parameters_2 would look something like:

tbl_parameters_2 <-
  tibble::tribble(
      ~snowflake_name,                  ~db_name, ~replace_na_with_zero, ~take_logarithm, ~round, ~standardize_as_zscore,
     "fun_to_be_with",          "AQNF_223_daqVV",                 FALSE,            TRUE, "down",                  FALSE,
        "work_ethics",        "DJLNGJN3445_NFKS",                  TRUE,            TRUE,   "up",                  FALSE,
                "bmi", "KLDJNSDOIJFW_295868FJDI",                 FALSE,           FALSE,   "up",                   TRUE,
  "risk_for_diabetes",     "DKFMDGNSQWRE_320586",                 FALSE,           FALSE, "down",                  FALSE
  )

But! {targets} doesn't allow to refer to an existing target by a string.

So if I use targets to create one target per DB table:

library(targets)

tar_target(raw_tbl_psych,  tbl(con, "AQNF_223_daqVV") %>% collect())
tar_target(raw_tbl_work,   tbl(con, "DJLNGJN3445_NFKS") %>% collect())
tar_target(raw_tbl_physical, tbl(con, "KLDJNSDOIJFW_295868FJDI") %>% collect())
tar_target(raw_tbl_blood,  tbl(con, "DKFMDGNSQWRE_320586") %>% collect())

And then want to have pmap() to iterate over tbl_parameters_2 and for each string of db_name replace it with the respective target, well it won't work.

swap_table_ugly_name_for_nice_target_name <- function(ugly_name) {
  
  switch(ugly_name,
         # ugly_name               # targets name
         "AQNF_223_daqVV"          = "raw_tbl_psych",
         "DJLNGJN3445_NFKS"        = "raw_tbl_work",
         "KLDJNSDOIJFW_295868FJDI" = "raw_tbl_physical",
         "DKFMDGNSQWRE_320586"     = "raw_tbl_blood"
         )
}
tar_target(list_of_wrangled_snowflakes,
           wrangle_snowflake(snowflake_name        = tbl_parameters_2$snowflake_name,
                             db_name               = swap_table_ugly_name_for_nice_target_name(tbl_parameters_2$db_name),
                             replace_na_with_zero  = tbl_parameters_2$replace_na_with_zero,
                             take_logarithm        = tbl_parameters_2$take_logarithm,
                             round                 = tbl_parameters_2$round,
                             standardize_as_zscore = tbl_parameters_2$standardize_as_zscore)
)

Well it just doesn't work. According to @landau this is because:

targets detects dependency relationships using static code analysis


For those who have come thus far reading, maybe you have an idea how to combine iteration and referring to preexisting targets?

Emman
  • 3,695
  • 2
  • 20
  • 44
  • 2
    this is the longest question I've ever posted. Its length is likely to hold people back from reading through and responding. If someone has a suggestion how to cut it down, I'll gladly do so. Thanks. – Emman Oct 23 '21 at 19:45
  • 1
    It's a lot to unpack, but maybe static branching could help: https://books.ropensci.org/targets/static.html. It requires approaching the problem a different way, but at a high level you could have a data frame of analysis settings and define a target for each row. Maybe have a look at `tar_map()` and `tar_eval()` from `tarchetypes`. Also, since you are using databases, maybe consider a more efficient `format` in `tar_target()`, e.g. `"feather"`. – landau Oct 23 '21 at 20:13
  • 1
    @landau, thanks. Can the data frame of analysis settings hold names of existing targets that will be passed as input when creating additional targets? – Emman Oct 24 '21 at 11:24
  • Both those existing and additional can be managed through branching, either static or dynamic (or both). There is usually no need to keep track of target names in a data frame. This is a different way of thinking about the workflow setup, but it will ultimately require less manual effort. – landau Oct 24 '21 at 13:08

0 Answers0