Running Apache Beam python pipelines in Kubernetes

Question

This question might seem like a duplicate of this.

I am trying to run Apache Beam python pipeline using flink on an offline instance of Kubernetes. However, since I have user code with external dependencies, I am using the Python SDK harness as an External Service - which is causing errors (described below).

The kubernetes manifest I use to launch the beam python SDK:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: beam-sdk
spec:
  replicas: 1
  selector:
    matchLabels:
      app: beam
      component: python-beam-sdk
  template:
    metadata:
      labels:
        app: beam
        component: python-beam-sdk
    spec:
      hostNetwork: True
      containers:
      - name: python-beam-sdk
        image: apachebeam/python3.7_sdk:latest
        imagePullPolicy: "Never"
        command: ["/opt/apache/beam/boot", "--worker_pool"]
        ports:
        - containerPort: 50000
          name: yay

apiVersion: v1
kind: Service
metadata:
  name: beam-python-service
spec:
  type: NodePort
  ports:
  - name: yay
    port: 50000
    targetPort: 50000
  selector:
    app: beam
    component: python-beam-sdk

When I launch my pipeline with the following options:

beam_options = PipelineOptions([
    "--runner=FlinkRunner",
    "--flink_version=1.9",
    "--flink_master=10.101.28.28:8081",
    "--environment_type=EXTERNAL",
    "--environment_config=10.97.176.105:50000",
    "--setup_file=./setup.py"
])

I get the following error message (within the python sdk service):

NAME                                 READY   STATUS    RESTARTS   AGE
beam-sdk-666779599c-w65g5            1/1     Running   1          4d20h
flink-jobmanager-74d444cccf-m4g8k    1/1     Running   1          4d20h
flink-taskmanager-5487cc9bc9-fsbts   1/1     Running   2          4d20h
flink-taskmanager-5487cc9bc9-zmnv7   1/1     Running   2          4d20h
(base) [~]$ sudo kubectl logs -f beam-sdk-666779599c-w65g5                                                                                                                   
2020/02/26 07:56:44 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=localhost:39283', '--artifact_endpoint=localhost:41533', '--provision_endpoint=localhost:42233', '--control_endpoint=localhost:44977']
2020/02/26 09:09:07 Initializing python harness: /opt/apache/beam/boot --id=1-1 --logging_endpoint=localhost:39283 --artifact_endpoint=localhost:41533 --provision_endpoint=localhost:42233 --control_endpoint=localhost:44977
2020/02/26 09:11:07 Failed to obtain provisioning information: failed to dial server at localhost:42233
    caused by:
context deadline exceeded

I have no idea what the logging- or artifact endpoint (etc.) is. And by inspecting the source code it seems like that the endpoints has been hard-coded to be located at localhost.

Unfortunately not. We're just using the local runner for our python pipelines, which works alright for our tasks. You might want to take a look at [this](https://stackoverflow.com/a/62881644/12933019) answer. — user488446, Aug 24 '20 at 09:16
@user488446 You start with referring to a question that might be a duplicate. That question has received answers afterwards so it would be good to check if these answer your question. If not please highlight the difference/gap. — Dennis Jaheruddin, Mar 02 '21 at 23:59
@DennisJaheruddin Sorry. I just checked the answer to the referenced post. The answer to the referenced post is indeed a valid answer to my question. Should I reference the answer on this post or delete my post? What do you suggest? Still learning stackoverflow. — user488446, Mar 04 '21 at 22:46
@user488446 It depends, if you find out that both the answer and question are the same deleting makes sense. If the answer is the same but the question different, closing may be better (so others with your question can still find the answer). Or taking the relevant part and answering this question while referring to the source is also good. Now this has been done so no need for further steps :) — Dennis Jaheruddin, Mar 14 '21 at 08:34

score 2 · Accepted Answer · answered Mar 11 '21 at 01:17

(You said in a comment that the answer to the referenced post is valid, so I'll just address the specific error you ran into in case someone else hits it.)

Your understanding is correct; the logging, artifact, etc. endpoints are essentially hardcoded to use localhost. These endpoints are meant to be only used internally by Beam and are not configurable. So the Beam worker is implicitly assumed to be on the same host as the Flink task manager. Typically, this is accomplished by making the Beam worker pool a sidecar of the Flink task manager pod, rather than a separate service.

Running Apache Beam python pipelines in Kubernetes

1 Answers1

Linked