4

I would like to run a kedro pipeline using different inputs and saving the results in an output folder where inputs paths and outputs paths are provided through the command line

I sow the possibility of using the kedro.config.TemplatedConfigLoader to pass new variables to a jinja2 template catalog, but in this way I can only manually define the globals_dict variables in the hooks as shown in the kedro documentation.

Ideally I would like to have to run something like this:

kedro run --pipeline="my_pipeline" --input="path_to_input_1" --output="path_to_output_1"
kedro run --pipeline="my_pipeline" --input="path_to_input_2" --output="path_to_output_2"

with a catalog like this:


input_df:
  type: pandas.CSVDataSet
  filepath: "${ input_path }"
  load_args:
    sep: "\t"
    index_col: 0
  save_args:
    index: True
    encoding: "utf-8"

output_df:
  type: pandas.CSVDataSet
  filepath: "${ output_path }"
  load_args:
    sep: "\t"
    index_col: 0
  save_args:
    index: True
    encoding: "utf-8"

and having the correct inputs analysed and the results stored in the correct output paths.

what would be the kedro way to achieve it?

Isy89
  • 179
  • 8

1 Answers1

0

so this isn't available by default. You could try experimenting with a few parts of kedro to make this work:

  • Edit cli.py to provide your new Click arguments
  • Experiment with before_dataset_loaded hooks to add your arguments to the globals.yml dictionary which TemplatedConfigLoader uses for string interpolation

That being said, perhaps run environments may be helpful for this?

https://kedro.readthedocs.io/en/latest/04_kedro_project_setup/02_configuration.html#additional-configuration-environments

In general we advise against dynamic pipelines which are hard to read at rest. We do encourage people to keep their catalogs explicit so they are readable in 6 months time.

datajoely
  • 1,466
  • 10
  • 13
  • Thanks for your answer! Do you mean that I could use the before_dataset_loaded to intercept the command line arguments I defined in the cli.py and passed through the command line? btw. I understand the problem with dynamic catalogues (in my case the pipeline would have a fix structure) and reproducibility therefore would be nice to be able to save the rendered jinja2 catalog such that if one would want to reproduce it, he/she would just have to run the pipeline with the saved catalogue. – Isy89 Sep 10 '21 at 10:40
  • So you would have to define your own project level CLI commands - or possibly override the existing `run` command. This is an interesting question - I'd suggest you ask this on our Discord server (https://discord.gg/akJDeVaxnB) and we can work this out together next week! – datajoely Sep 11 '21 at 22:01
  • Perfect, I posted the problem in Discord! thanks! – Isy89 Sep 13 '21 at 20:19
  • This is something I'm also desperately needing. I have multiple datasets all going through the same pipeline. And configuring a pipeline per dataset is more than exhausting. Is there anything new to solve this problem out there? – kerfuffle Mar 24 '22 at 12:55
  • Hi @kerfuffle from the Kedro perspective, we don't necessarily view this as a problem. We encourage users to write explicit, reproducible pipelines - you're welcome to extend the framework if you want to go off-piste, but it is unlikely would provide this as out-of-the-box functionality. – datajoely Mar 24 '22 at 13:30
  • @datajoely Thanks for your quick answer! I'm new to Kedro so some things are still confusing to me. I mostly struggle with starting a pipeline with the CLI or within some python script I wrote myself. E.g modular pipelines seem not to have any CLI options. Is there any recommended way to start pipelines? – kerfuffle Mar 24 '22 at 16:30
  • 1
    So @kerfuffle I think you're looking for this https://kedro.readthedocs.io/en/stable/09_development/03_commands_reference.html#modifying-a-kedro-run but in general, jump onto our Discord and we can better support you there :) – datajoely Mar 25 '22 at 15:40