How do I define pipeline-level volumes in kubeflow pipelines to share across components?

Question

The kubernetes Communicating between containers tutorial defines the following pipeline yaml:

apiVersion: v1
kind: Pod
metadata:
  name: two-containers
spec:

  restartPolicy: Never

  volumes:                      <--- This is what I need
  - name: shared-data
    emptyDir: {}

  containers:

  - name: nginx-container
    image: nginx
    volumeMounts:
    - name: shared-data
      mountPath: /usr/share/nginx/html

  - name: debian-container
    image: debian
    volumeMounts:
    - name: shared-data
      mountPath: /pod-data
    command: ["/bin/sh"]
    args: ["-c", "echo Hello from the debian container > /pod-data/index.html"]

Note that the volumes key is defined under spec, and thus the volume is available to all defined containers. I want to achieve the same behavior using kfp, which is the API for kubeflow pipelines.

However, I can only add volumes to individual containers, but not to the whole workflow spec using kfp.dsl.ContainerOp.container.add_volume_mount that points to a previously created volume (kfp.dsl.PipelineVolume), because the volume seems to only be defined within a container.

Here is what I have tried, but the volume is always defined in the first container, not the "global" level. How do I get it so that op2 has access to the volume? I would have expected it to be inside kfp.dsl.PipelineConf, but volumes can not be added to it. Is it just not implemented?

import kubernetes as k8s
from kfp import compiler, dsl
from kubernetes.client import V1VolumeMount
import pprint

@dsl.pipeline(name="debug", description="Debug only pipeline")
def pipeline_func():
    op = dsl.ContainerOp(
            name='echo',
            image='library/bash:4.4.23',
            command=['sh', '-c'],
            arguments=['echo "[1,2,3]"> /tmp/output1.txt'],
            file_outputs={'output': '/tmp/output1.txt'})
    op2 = dsl.ContainerOp(
            name='echo2',
            image='library/bash:4.4.23',
            command=['sh', '-c'],
            arguments=['echo "[4,5,6]">> /tmp/output1.txt'],
            file_outputs={'output': '/tmp/output1.txt'})

    mount_folder = "/tmp"
    volume = dsl.PipelineVolume(volume=k8s.client.V1Volume(
            name=f"test-storage",
            empty_dir=k8s.client.V1EmptyDirVolumeSource()))
    op.add_pvolumes({mount_folder: volume})
    op2.container.add_volume_mount(volume_mount=V1VolumeMount(mount_path=mount_folder,
                                                              name=volume.name))
    op2.after(op)


workflow = compiler.Compiler().create_workflow(pipeline_func=pipeline_func)
pprint.pprint(workflow["spec"])

score 5 · Answer 1 · answered Sep 09 '20 at 06:58

5

You might want to check the difference between Kubernetes pods and containers. The Kubernetes example you've posted shows a two-container pod. You can recreate the same example in KFP by adding a sidecar container to an instantiated ContainerOp. What your second example is doing is creating two single-container pods that do not see each other by design.

To exchange data between pods you'd need some real volume, not emptyDir which only works for container is a single pod.

volume = dsl.PipelineVolume(volume=k8s.client.V1Volume(
        name=f"test-storage",
        empty_dir=k8s.client.V1EmptyDirVolumeSource()))
op.add_pvolumes({mount_folder: volume})

Please do not use dsl.PipelineVolume or op.add_pvolume unless you know what it is and why you want it. Just use normal op.add_volume and op.container.add_volume_mount.

Nevertheless, is there a particular reason you need to use volumes? Volumes make pipelines and components non-portable. No 1st-party components use volumes.

KFP team encourages users to use normal data passing methods: non-python, python

answered Sep 09 '20 at 06:58

Ark-kun

6,358
2
34
70

I had to use volumes as a workaround for [this issue](https://github.com/kubernetes/autoscaler/issues/1869). The code I am actually running is also in an init_container, and that sets up the environment for the subsequent components. I would have liked to share the resulting environment (e.g. file system) of the init container with the other components instead of having to run the init container for every component. I can't bake this environment into a docker image because it is dynamic – RunOrVeith Sep 09 '20 at 07:27
Ideally, the components should act like pure functions with all data being passed to them as input argument, and not accessed out of band from some global storage. Is there a reason you cannot pass that "resulting environment" between components? – Ark-kun Sep 10 '20 at 05:18
It is setting up the environment in which the components run, e.g. cloning some git repos (from our internal package manager, so I have to do it manually) – RunOrVeith Sep 10 '20 at 07:09
I see. But is something stopping you from outputting that environment with cloned GIT repos and passing it to other components? See this example that passes GIT repos around: https://github.com/Ark-kun/kfp_samples/blob/master/2019-10%20Kubeflow%20summit/106%20-%20Creating%20components%20from%20command-line%20programs/106%20-%20Creating%20components%20from%20command-line%20programs.ipynb – Ark-kun Sep 22 '20 at 06:08
Just want to mention that although I'm steering people away from volumes, my answer has a volume-based solution proposal: create a Kubernetes volume and then mount it to both container using op.add_volume and op.container.add_volume_mount. If you have a PVC in teh cluster, you can do that easier using the kfp.onprem.mount_pvc helper mod. – Ark-kun Sep 22 '20 at 06:10
@Ark-kun I found myself needing to use volumes because the regular way isn't working for me. I tried [the notebook you posted](https://github.com/Ark-kun/kfp_samples/blob/master/2019-10%20Kubeflow%20summit/106%20-%20Creating%20components%20from%20command-line%20programs/106%20-%20Creating%20components%20from%20command-line%20programs.ipynb) but it gives me `This step is in Error state with this message: failed to save outputs: read /tmp/outputs/Repo_dir/data: is a directory`. I'm on GKE, Kubernetes 1.19, Kubeflow 1.4.1, Kubeflow SDK 1.6, runtime=pns (had to switch from `docker` runtime). – Nader Ghanbari Jun 28 '21 at 06:00
Kubernetes 1.19 messes up Docker. You have several solutions: 1) Downgrade Kubernetes to 1.18 (or create new cluster). 2) Switch Argo to use the PNS executor. Although it might have some compat issues. 3) Track whether KFP has updated to Argo 3 with the newest emissary executor. In my opinion, KFP without data passing loses a lot of its value. – Ark-kun Jun 28 '21 at 06:49

How do I define pipeline-level volumes in kubeflow pipelines to share across components?

1 Answers1