0

I am working with a legacy project of Kubeflow, the pipelines have a few components in order to apply some kind of filters to data frame.

In order to do this, each component downloads the data frame from S3 applies the filter and uploads it into S3 again.

In the components where the data frame is used for training or validating the models, download from S3 the data frame.

The question is about if this is a best practice, or is better to share the data frame directly between components, because the upload to the S3 can fail, and then fail the pipeline.

Thanks

sophros
  • 14,672
  • 11
  • 46
  • 75
Tlaloc-ES
  • 4,825
  • 7
  • 38
  • 84

2 Answers2

1

As always with questions asking for "best" or "recommended" method, the primary answer is: "it depends".

However, there are certain considerations worth spelling out in your case.

  1. Saving to S3 in between pipeline steps. This stores intermediate result of the pipeline and as long as the steps take long time and are restartable it may be worth doing that. What "long time" means is dependent on your use case though.

  2. Passing the data directly from component to component. This saves you storage throughput and very likely the not insignificant time to store and retrieve the data to / from S3. The downside being: if you fail mid-way in the pipeline, you have to start from scratch.

So the questions are:

  • Are the steps idempotent (restartable)?
  • How often the pipeline fails?
  • Is it easy to restart the processing from some mid-point?
  • Do you care about the processing time more than the risk of loosing some work?
  • Do you care about the incurred cost of S3 storage/transfer?
sophros
  • 14,672
  • 11
  • 46
  • 75
  • But I can understand that you want to store the dataframe in order to don't start the pipeline from scatch, but anyway is not best option pass throug components and only store at the end of each component? – Tlaloc-ES May 07 '21 at 12:58
  • I am sorry, I don't understand the question. Could you please rephrase it? – sophros May 07 '21 at 13:12
  • >"The downside being: if you fail mid-way in the pipeline, you have to start from scratch." - I'm not sure that's the case. The caching feature ensures that the pipeline can be started again and it will only execute unfinished steps. In fact the option 2 is significantly safer since it does not use mutable external state that can get corrupted and break everything. – Ark-kun Jun 09 '21 at 05:26
  • @Ark-kun: did you mean [Kubeflow Pipeline caching](https://www.kubeflow.org/docs/components/pipelines/caching/)? Indeed, it seems to provide steps caching and you may be right. – sophros Jun 09 '21 at 05:41
  • Yes, I meant the Kubeflow Pipelines' execution caching. – Ark-kun Jun 09 '21 at 07:59
1

The question is about if this is a best practice

The best practice is to use the file-based I/O and built-in data-passing features. The current implementation uploads the output data to storage in upstream components and downloads the data in downstream components. This is the safest and most portable option and should be used until you see that it no longer works for you (100GB datasets will probably not work reliably).

or is better to share the data frame directly between components

How can you "directly share" in-memory python object between different python programs running in containers on different machines?

because the upload to the S3 can fail, and then fail the pipeline.

The failed pipeline can just be restarted. The caching feature will make sure that already finished tasks won't be re-executed.

Anyways, what is the alternative? How can you send the data between distributed containerized programs without sending it over the network?

Ark-kun
  • 6,358
  • 2
  • 34
  • 70