I'm using the {targets}
package in a data analysis project where I need to fetch datasets from a remote Oracle database.
My overarching task is twofold:
- get data from scattered over different views and tables in the Oracle DB.
- compute and wrangle a new data column based on data I pulled from Oracle.
- those computed columns, although originating from separate database tables, do share enough in common so I can merge them by some related index columns.
Each new column I compute is a "special snowflake", so I've build a dedicated wrangling function for each one, and a higher-order function to call each snowflake-wrangling sub function.
My problem is with iteration over data entities that had been created with targets::tar_target()
.
Reproducible Example
To accurately convey my problem, I unfortunately need to dedicate a decent amount of code in this example. The first part is only for generating the demo data and mimic the Oracle DB. You can just run the code and skip to the part that follows it.
1. Simulate data: Build a database with 4 tables
just run this code; not critical for understanding the problem
library(dplyr, warn.conflicts = FALSE)
library(babynames)
library(DBI)
library(RSQLite)
set.seed(2021)
simulate_df_from_colnames <- function(vec_of_colnames, desired_nrows, vec_of_ids) {
stopifnot(desired_nrows == length(vec_of_ids))
ncols <- length(vec_of_colnames)
n_values <- ncols * desired_nrows
vec <- runif(n = n_values, min = 1, max = 100)
vec[sample(1:length(vec), 0.2 * length(vec))] <- NA # sprinkle NA randomly in 20% of values
mat <- matrix(vec, ncol = ncols)
df <- as.data.frame(mat)
colnames(df) <- vec_of_colnames
df$id <- vec_of_ids
df <- df[,c(ncol(df),1:(ncol(df)-1))] # so the id column move from last to first position
return(df)
}
work_related <- c("acceptance", "accountability", "achievement", "adaptability", "adventure", "authenticity", "authority", "autonomy", "balance", "boldness", "bravery", "candor", "challenge", "clarity", "collaboration", "compassion", "communication", "community", "contribution", "creativity", "curiosity", "dependability", "determination", "diversity", "empathy", "enthusiasm", "equality", "family", "fairness", "flexibility", "friendship", "growth", "happiness", "hard_work", "honesty", "humility", "humor", "impact", "improvement", "ingenuity", "innovation", "kindness", "knowledge", "leadership", "learning", "loyalty", "meaningful_work", "optimism", "ownership", "participation", "patience", "peace", "persistence", "popularity", "power", "quality", "recognition", "relationships", "reliability", "reputation", "respect", "responsibility", "results", "security", "self_improvement", "simplicity", "spirituality", "stability", "success", "sustainability", "teamwork", "tenacity", "time_management", "transparency", "trustworthiness", "wealth", "wisdom", "work_ethic", "work_life_balance")
blood_tests <- c("white_blood_cell_count", "red_blood_cell_count", "hemoglobin", "hematocrit", "mean_corpuscular_volume", "platelet_count", "sodium", "potassium", "chloride", "carbon_dioxide", "blood_urea_nitrogen", "creatinine", "glucose", "calcium", "total_protein", "albumin", "bilirubin", "alkaline_phosphatase", "ast", "alt", "vitamin_b_12", "methylmalonic_acid", "ferritin")
physical <- c("systolic_blood_pressure", "diastolic_blood_pressure", "pulse_rate_beats_minute", "height", "weight", "bmi", "waist_circumference", "hip_circumference")
psych_traits <- c("accessible", "active", "adaptable", "admirable", "adventurous", "agreeable", "alert", "allocentric", "amiable", "anticipative", "appreciative", "articulate", "aspiring", "athletic", "attractive", "balanced", "benevolent", "brilliant", "calm", "capable", "captivating", "caring", "challenging", "charismatic", "charming", "cheerful", "clean", "clear_headed", "clever", "colorful", "companionly", "compassionate", "conciliatory", "confident", "conscientious", "abrasive", "abrupt", "agonizing", "aimless", "airy", "aloof", "amoral", "angry", "anxious", "apathetic", "arbitrary", "argumentative", "arrogantt", "artificial", "asocial", "assertive", "astigmatic", "barbaric", "bewildered", "bizarre", "bland", "blunt", "biosterous", "brittle", "brutal", "calculating", "callous", "cantakerous", "careless", "cautious", "charmless", "childish", "clumsy", "coarse", "cold")
my_names <-
babynames::babynames %>%
pull(name) %>%
unique() %>%
sample(1000)
df_work_related <- simulate_df_from_colnames(work_related, 1000, vec_of_ids = my_names)
df_blood_tests <- simulate_df_from_colnames(blood_tests , 1000, vec_of_ids = my_names)
df_physical <- simulate_df_from_colnames(physical , 1000, vec_of_ids = my_names)
df_psych_traits <- simulate_df_from_colnames(psych_traits, 1000, vec_of_ids = my_names)
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, df_work_related, name = "DJLNGJN3445_NFKS")
copy_to(con, df_blood_tests , name = "DKFMDGNSQWRE_320586")
copy_to(con, df_physical , name = "KLDJNSDOIJFW_295868FJDI")
copy_to(con, df_psych_traits, name = "AQNF_223_daqVV")
Created on 2021-10-23 by the reprex package (v2.0.1)
2. My problem starts here; I have a database full of data I want to analyze.
After running the simulation code from above, we now have the con
object representing a remote DB. We can explore what tables are inside con
:
DBI::dbListObjects(con)
#> table is_prefix
#> 1 <Id> table = AQNF_223_daqVV FALSE
#> 2 <Id> table = DJLNGJN3445_NFKS FALSE
#> 3 <Id> table = DKFMDGNSQWRE_320586 FALSE
#> 4 <Id> table = KLDJNSDOIJFW_295868FJDI FALSE
#> 5 <Id> table = sqlite_stat1 FALSE
#> 6 <Id> table = sqlite_stat4 FALSE
Whomever granted access to the database also told us that it stores data about 1000 people, scattered over 4 different tables.
Table Name | Kind of Data Stored There |
---|---|
AQNF_223_daqVV | Psychological Measurements |
DJLNGJN3445_NFKS | Measurements related to employment |
DKFMDGNSQWRE_320586 | Blood Tests |
KLDJNSDOIJFW_295868FJDI | Physical Measurements such as height, weight, etc. |
3. Walkthrough with one new variable as example
Let's say that I want to compute a new variable that represents whether a person is fun to be with. From the table above I see that table name "AQNF_223_daqVV"
contains psychological measures so I understand it's a relevant table. Exploring that data I decide that my new variable "fun_to_be_with"
will be the average of existing variables accessible
, active
, and adaptable
.
library(dplyr)
compute_fun_to_be_with <- function(.dat) {
.dat %>%
select(id, accessible, active, adaptable) %>%
mutate(fun_to_be_with = rowMeans(across(c(accessible, active, adaptable))), .keep = "unused")
}
tbl(con, "AQNF_223_daqVV") %>%
collect() %>%
compute_fun_to_be_with()
#> # A tibble: 1,000 x 2
#> id fun_to_be_with
#> <chr> <dbl>
#> 1 Miari NA
#> 2 Demariana NA
#> 3 Halah NA
#> 4 Abdalah NA
#> 5 Infiniti NA
#> 6 Sydel 63.0
#> 7 Montelle 62.8
#> 8 Rhys NA
#> 9 Mijah 73.0
#> 10 Lamontre NA
#> # ... with 990 more rows
When I started exploring more interesting variables I could compute, I've come to realize that besides the initial computation (e.g., taking the mean), there are also several wrangling steps that are applicable for some variables but not for others. For example, sometimes I want to round the variable's values up or down, or take a logarithm, or otherwise. So each new variable I compute is a "special snowflake", and I have a wrapper function that orchestrates all those preferences.
compute_snowflake <- function(.dat, snowflake_name) {
switch(snowflake_name,
"fun_to_be_with" = compute_fun_to_be_with(.dat))
}
wrangle_snowflake <- function(snowflake_name,
raw_data_from_db,
replace_na_with_zero,
take_logarithm,
round = c("up", "down"),
standardize_as_zscore) {
raw_data_from_db %>%
compute_snowflake(snowflake_name) %>%
{if (replace_na_with_zero) mutate(., across({{ snowflake_name }}, tidyr::replace_na, 0)) else .} %>%
{if (take_logarithm) mutate(., across(fun_to_be_with, log)) else .} %>%
{if (round == "up") mutate(., across(fun_to_be_with, ceiling)) else .} %>%
{if (round == "down") mutate(., across(fun_to_be_with, floor)) else .} %>%
{if (standardize_as_zscore) mutate(., across(fun_to_be_with, scale)) else .}
}
3a) The need for iteration
For wrangling just one new variable, wrangle_snowflake ()
can be used as-is:
wrangle_snowflake(snowflake_name = "fun_to_be_with",
raw_data_from_db = tbl(con, "AQNF_223_daqVV") %>% collect(),
replace_na_with_zero = FALSE,
take_logarithm = TRUE,
round = "down",
standardize_as_zscore = FALSE)
But the issue is with the scale. My project requires wrangling about 100 new variables. And I don't want to repeat wrangle_snowflake()
this way 100 times. If we just save the DB table locally as an object in the environment, we can utilize purrr::pmap()
very nicely for iteration:
raw_tbl_psych <-
tbl(con, "AQNF_223_daqVV") %>%
collect()
tbl_parameters <-
tibble::tribble(~snowflake_name, ~raw_data_from_db, ~replace_na_with_zero, ~take_logarithm, ~round, ~standardize_as_zscore,
"fun_to_be_with", raw_tbl_psych, FALSE, TRUE, "down", FALSE)
tbl_parameters
#> # A tibble: 1 x 6
#> snowflake_name raw_data_from_db replace_na_with_zero take_logarithm round
#> <chr> <list> <lgl> <lgl> <chr>
#> 1 fun_to_be_with <tibble [1,000 x 71]> FALSE TRUE down
#> # ... with 1 more variable: standardize_as_zscore <lgl>
tbl_parameters %>%
purrr::pmap(.f = wrangle_snowflake)
#> [[1]]
#> # A tibble: 1,000 x 2
#> id fun_to_be_with
#> <chr> <dbl>
#> 1 Miari NA
#> 2 Demariana NA
#> 3 Halah NA
#> 4 Abdalah NA
#> 5 Infiniti NA
#> 6 Sydel 4
#> 7 Montelle 4
#> 8 Rhys NA
#> 9 Mijah 4
#> 10 Lamontre NA
#> # ... with 990 more rows
pmap()
is very powerful because I can extend tbl_parameters
and add many more snowflakes, but the call tbl_parameters %>% purrr::pmap(.f = wrangle_snowflake)
will remain the same.
3b) the need for {targets}
One major issue isn't reflected in this example: the data I need to fetch from the remote DB is huge. Each table (e.g., AQNF_223_daqVV
) could range 1-10 millions of rows. When that's the case, I don't want to load the entire data into the environment as an R object. Rather, the {targets}
package allows me to create a "target" for each giant table, which is stored as an .rds
file in the directory. This way I can use the giant table indirectly, without loading it.
finally: my problem
{targets}
doesn't work well with my pmap()
method. Since I don't want to bring the entire giant data tables to R's environment, I'd rather to simply refer to them by their name. This way, my tbl_parameters_2
would look something like:
tbl_parameters_2 <-
tibble::tribble(
~snowflake_name, ~db_name, ~replace_na_with_zero, ~take_logarithm, ~round, ~standardize_as_zscore,
"fun_to_be_with", "AQNF_223_daqVV", FALSE, TRUE, "down", FALSE,
"work_ethics", "DJLNGJN3445_NFKS", TRUE, TRUE, "up", FALSE,
"bmi", "KLDJNSDOIJFW_295868FJDI", FALSE, FALSE, "up", TRUE,
"risk_for_diabetes", "DKFMDGNSQWRE_320586", FALSE, FALSE, "down", FALSE
)
But! {targets}
doesn't allow to refer to an existing target by a string.
So if I use targets
to create one target per DB table:
library(targets)
tar_target(raw_tbl_psych, tbl(con, "AQNF_223_daqVV") %>% collect())
tar_target(raw_tbl_work, tbl(con, "DJLNGJN3445_NFKS") %>% collect())
tar_target(raw_tbl_physical, tbl(con, "KLDJNSDOIJFW_295868FJDI") %>% collect())
tar_target(raw_tbl_blood, tbl(con, "DKFMDGNSQWRE_320586") %>% collect())
And then want to have pmap()
to iterate over tbl_parameters_2
and for each string of db_name
replace it with the respective target, well it won't work.
swap_table_ugly_name_for_nice_target_name <- function(ugly_name) {
switch(ugly_name,
# ugly_name # targets name
"AQNF_223_daqVV" = "raw_tbl_psych",
"DJLNGJN3445_NFKS" = "raw_tbl_work",
"KLDJNSDOIJFW_295868FJDI" = "raw_tbl_physical",
"DKFMDGNSQWRE_320586" = "raw_tbl_blood"
)
}
tar_target(list_of_wrangled_snowflakes,
wrangle_snowflake(snowflake_name = tbl_parameters_2$snowflake_name,
db_name = swap_table_ugly_name_for_nice_target_name(tbl_parameters_2$db_name),
replace_na_with_zero = tbl_parameters_2$replace_na_with_zero,
take_logarithm = tbl_parameters_2$take_logarithm,
round = tbl_parameters_2$round,
standardize_as_zscore = tbl_parameters_2$standardize_as_zscore)
)
Well it just doesn't work. According to @landau this is because:
targets
detects dependency relationships using static code analysis
For those who have come thus far reading, maybe you have an idea how to combine iteration and referring to preexisting targets?