2

I'd like to use drake to audit a series of validation and cleaning steps for a dataframe. I think there will be many functions that form a chain, where a dataframe will be passed in, a validation will happen, or a cleaning will happen, and the (possibly cleaned) dataframe will be passed onto the next step. Is there a way to create a chain of function calls without explicitly naming them in the plan?

A plan may look like this:

plan <- drake_plan(
    raw_data = load_data(),
    clean_data_1 = clean_step_1(raw_data, parms = "some parm"),
    clean_data_2 = clean_step_2(clean_data_1, parms = "some parm"),
    clean_data_3 = clean_step_3(clean_data_2, parms = "some parm"),
    ...
    clean_data_100 = clean_step_100(clean_data_99, parms = "some parm"),
)

Is there a way to create this plan without having to come up with the intermediate names clean_data_<n>, and have drake generate those names? It would be nice to keep a config file or some such of the cleaning steps in order, and not have to track the data names so that they can be assembled just in the order that they occur in my config file.

mpettis
  • 3,222
  • 4
  • 28
  • 35

2 Answers2

2

I made a slight tweak to @landau 's answer below. It wasn't splicing in the different functions, and I added a part where I splice in a params argument that is also dynamic but specific to each function.

# https://stackoverflow.com/q/58139703/1022967

library(drake)
library(rlang)
library(tibble)

functions <- syms(paste0("f", seq_len(4)))
index <- as.numeric(seq_len(4))
inputs <- syms(paste0("x_", index - 1))
#params = letters[1:4]
params = c('{"a":1, "b":"z"}', '{"a":2, "b":"z"}', '{"a":3, "b":"z"}', '{"a":4, "b":"z"}')

grid <- tibble(
  functions = functions,
  index = index,
  inputs = inputs,
  params = params
)

plan <- drake_plan(
  x = target(
    f(inputs, param = p),
    transform = map(.data = !!grid, .id = index, f = !!functions, p = !!params)
  )
)

plan
#> # A tibble: 4 x 2
#>   target command                                  
#>   <chr>  <expr>                                   
#> 1 x_1    f1(x_0, param = "{\"a\":1, \"b\":\"z\"}")
#> 2 x_2    f2(x_1, param = "{\"a\":2, \"b\":\"z\"}")
#> 3 x_3    f3(x_2, param = "{\"a\":3, \"b\":\"z\"}")
#> 4 x_4    f4(x_3, param = "{\"a\":4, \"b\":\"z\"}")

# config <- drake_config(plan)
# vis_drake_graph(config)

Created on 2019-09-27 by the reprex package (v0.3.0)

mpettis
  • 3,222
  • 4
  • 28
  • 35
1

I can think of a couple different ways using rlang::syms() and transformations in drake_plan(). First one:

library(drake)
library(rlang)

functions <- syms(paste0("f", seq_len(4)))
index <- as.numeric(seq_len(4))
inputs <- syms(paste0("x_", index - 1))

plan <- drake_plan(
  x = target(
    f(x, param = "some param"),
    transform = map(f = !!functions, x = !!inputs, id = !!index, .id = id)
  )
)

plan
#> # A tibble: 4 x 2
#>   target command                      
#>   <chr>  <expr>                       
#> 1 x_1    f1(x_0, param = "some param")
#> 2 x_2    f2(x_1, param = "some param")
#> 3 x_3    f3(x_2, param = "some param")
#> 4 x_4    f4(x_3, param = "some param")

config <- drake_config(plan)
vis_drake_graph(config)

Created on 2019-09-27 by the reprex package (v0.3.0)

Second one:

library(drake)
library(rlang)
library(tibble)

f <- syms(paste0("f", seq_len(4)))
index <- as.numeric(seq_len(4))
inputs <- syms(paste0("x_", index - 1))

grid <- tibble(
  f = f,
  index = index,
  inputs = inputs
)

plan <- drake_plan(
  x = target(
    f(inputs, param = "some param"),
    transform = map(.data = !!grid, .id = index)
  )
)

plan
#> # A tibble: 4 x 2
#>   target command                      
#>   <chr>  <expr>                       
#> 1 x_1    f1(x_0, param = "some param")
#> 2 x_2    f2(x_1, param = "some param")
#> 3 x_3    f3(x_2, param = "some param")
#> 4 x_4    f4(x_3, param = "some param")

config <- drake_config(plan)
vis_drake_graph(config)

Created on 2019-09-27 by the reprex package (v0.3.0)

landau
  • 5,636
  • 1
  • 22
  • 50
  • Thank you! I should add some clarification -- the function names may not follow a nice pattern, but I can follow the idea to read in a list of names and convert them to syms, so that is ok. The second part is that the param values will be unique to the different function names. If it helps, I can change the question to reflect that. So, ultimately, I'll likely have a CSV or dataframe with, say, 2 columns: the first being the function name (I can handle that part), and the second being parameter values to feed that particular function with (after the first arg of the dataframe. – mpettis Sep 27 '19 at 20:58
  • 1
    You can sub in the function args the same way as function names. Need a demo? – landau Sep 27 '19 at 21:00
  • I'll experiment and see if I can splice in the arguments I'm thinking from your pattern of creating the grid, adding a column of different `param` arguments going with the different functions. – mpettis Sep 27 '19 at 21:01
  • Yup, that should do it. – landau Sep 27 '19 at 21:01
  • I'll try myself, but either you or I will want to put the solution here for completeness, I think. That would be nice for posterity. It would be nice if it were added to your solution, however, since I'll accept it as the preferred solution then. Thank you for your help and awesome package. – mpettis Sep 27 '19 at 21:02
  • A slight correction on the second solution: to get the functions to resolve properly, I believe you want to use this transform line: `transform = map(.data = !!grid, .id = index, f = !!functions)` – mpettis Sep 27 '19 at 21:13
  • I made an answer that has the correction above, plus I added the dynamic param list addition. I'm accepting your answer as the preferred one, and you can alter yours to add my stuff, or I'll just leave mine up for additional info. Thank you again. – mpettis Sep 27 '19 at 21:23