1

Let's assume you have a pipeline with steps that can fail for some input elements for example: FetchSomeImagesFromIds -> Resize -> DoSomethingElse

In this case the 1st step downloads 10 out of a 100 images... and passes those to resize..

I'm looking for suggestions on how to report or handle this missing data at the pipeline level for example something like: Pipeline.errors() -> PluginX: Succeed: 10, Failed: 90, Total: 100, Errors: key: error

My current implementation removes the missing keys from current_keys so that the key -> data mapping is kept and actually exits the whole program if there's anything missing.. given the previous problem with https://github.com/Neuraxio/Neuraxle/issues/418

Thoughts?

hexa
  • 31
  • 2

1 Answers1

0

I think that using a Service in your pipeline would be the good way. Here's what I'd do if I think about it, although more solutions could exist:

  1. Create your pipeline and pipeline steps.
  2. Create a context and add to the context a custom memory bank service in which you can keep track of what data processed properly or not properly. Depending on your needs and broader context, it could be either a positive data bank, or negative one, in which you'd respectively either add the processed examples or substract them from the set.
  3. Adapt the pipeline made at point 1, and its steps, such that it can use the service from the context in the handle_transform_data_container methods. You could even have a WhileTrue() step which would loop forever until a BreakIf() step would evaluate that everything has been processed for instance, if you want your pipeline to work until everything has been processed, and fetching the batches as they come without an end condition other than the BreakIf step and its lambda. The lambda would call the service indeed to know where the data processing is at.
  4. At the end of the whole processing, wheter you breaked prematurely (without any while loop) or wheter you did break only at the end, you still have access to the context and to what's stored inside.

More info:

  • To see an example on how to use the service and context together and using this in steps, see this answer: https://stackoverflow.com/a/64850395/2476920
  • Also note that the BreakIf and While steps are core steps that are not yet developed in Neuraxle. We've recently had a brilliant ideas with Vincent Antaki where Neuraxle is a language, and therefore steps in a pipeline are like basic language keywords (While, Break, Continue, ForEach) and so forth. This abstraction is powerful in the sense that it's possible to control a data flow as a logical execution flow.

This is my best solution for now and this exactly has never been done yet. There may be much more other ways to do this, with creativity. One could even think of doing TryCatch steps in the pipeline to catch some errors and managing what happens in the execution flow of the data.

All of this is to be done in Neuraxle and is not yet done. We're looking for collaborators. That would do a nice paper as well: "Machine Learning Pipelines as a Programming Language" :)

Guillaume Chevalier
  • 9,613
  • 8
  • 51
  • 79