I'm exploring Kubeflow as an option to deploy and connect various components of a typical ML pipeline. I'm using docker containers as Kubeflow components and so far I've been unable to successfully use ContainerOp.file_outputs
object to pass results between components.
Based on my understanding of the feature, creating and saving to a file that's declared as one of the file_outputs
of a component should cause it to persist and be accessible for reading by the following component.
This is how I attempted to declare this in my pipeline python code:
import kfp.dsl as dsl
import kfp.gcp as gcp
@dsl.pipeline(name='kubeflow demo')
def pipeline(project_id='kubeflow-demo-254012'):
data_collector = dsl.ContainerOp(
name='data collector',
image='eu.gcr.io/kubeflow-demo-254012/data-collector',
arguments=[ "--project_id", project_id ],
file_outputs={ "output": '/output.txt' }
)
data_preprocessor = dsl.ContainerOp(
name='data preprocessor',
image='eu.gcr.io/kubeflow-demo-254012/data-preprocessor',
arguments=[ "--project_id", project_id ]
)
data_preprocessor.after(data_collector)
#TODO: add other components
if __name__ == '__main__':
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline, __file__ + '.tar.gz')
In the python code for the data-collector.py
component I fetch the dataset then write it to output.txt
. I'm able to read from the file within the same component but not inside data-preprocessor.py
where I get a FileNotFoundError
.
Is the use of file_outputs
invalid for container-based Kubeflow components or am I incorrectly using it in my code? If it's not an option in my case, is it possible to programmatically create Kubernetes volumes inside the pipeline declaration python code and use them instead of file_outputs
?