1

In Apache Beam's Spark documentation, it says that you can specify --environment_type="DOCKER" to customize the runtime environment:

The Beam SDK runtime environment can be containerized with Docker to isolate it from other runtime systems. To learn more about the container environment, read the Beam SDK Harness container contract.

...

You may want to customize container images for many reasons, including:

  • Pre-installing additional dependencies
  • Launching third-party software in the worker environment
  • Further customizing the execution environment

...

python -m apache_beam.examples.wordcount \
--input=/path/to/inputfile \
--output=path/to/write/counts \
--runner=SparkRunner \
# When running batch jobs locally, we need to reuse the container.
--environment_cache_millis=10000 \
--environment_type="DOCKER" \
--environment_config="${IMAGE_URL}"

If you submit this job to an existing Spark cluster, what does the docker image do to the Spark cluster? Does it run all the Spark executors with that Docker image? What happens to the existing Spark executors if there are any? What about the Spark driver? What is the mechanism used (Spark Driver API?) to distribute the Docker image to the machines?

cozos
  • 787
  • 10
  • 19

1 Answers1

1

TL;DR The answer for this question is on this picture Beam Portability Architecture (it's taken form my talk for Beam Summit 2020 about running cross-language pipelines on Beam).

For example, if you run your Beam pipeline with Beam Portable Runner on Spark cluster, then Beam Portable Spark Runner will translate your job into a normal Spark job and then submit/run it on ordinary Spark cluster. So, it will use driver/executors of your Spark cluster (as usually).

As you can see from this picture, the Docker container is using just as part of SDK Harness to execute DoFn code independently from the "main" language of your pipeline (for example, run some Python code as a part of Java pipeline).

The only requirement, iirc, is that your Spark executors should be have installed Docker to run Docker container(s). Also, you can pre-fetch Beam SDK Docker images on Spark executors nodes to avoid it while running your job for the first time.

Alternative solution, that Beam Portability provides for portable pipelines, could be to execute SDK Harness as just a normal system process. In this case, you need to specify environment_type="PROCESS" and provide a path to executable file (that obviously has to be installed on all executor nodes).

Alexey Romanenko
  • 1,353
  • 5
  • 11
  • Thank you! I will watch your Apache Beam talk to understand more. Follow up question: How does the SDK Harness Docker container interact with Spark data processing? Where does the data processing actually happen - Spark Executors / RDDs or the Beam SDK Harness DoFns? How do the Spark executors and Beam SDK Harness communicate? – cozos Sep 10 '22 at 14:11
  • Do you have any suggestions where I can read up more on how Spark/PortableRunner works? This is the closest thing I can find https://docs.google.com/document/d/1j8GERTiHUuc6CzzCXZHc38rBn41uWfATBh2-5JN8hro/edit?usp=drivesdk – cozos Sep 10 '22 at 14:16
  • 1
    If you wish to get more details, then I'd suggest you to take a look on this design doc about Beam SDK Harness container contract: https://s.apache.org/beam-fn-api-container-contract There are also a bunch of design docs here under "Portability section" that can be useful: https://cwiki.apache.org/confluence/display/BEAM/Design+Documents – Alexey Romanenko Sep 12 '22 at 16:10
  • 1
    Following up your Spark-related questions - all data processing is happening on Spark. Beam Portable Runner will just translate your Beam pipeline into a Spark pipeline (in case of using Beam Portable Spark Runner) that then will run on your Spark cluster as usually. All "magic", that involves SDK Harness calls, is happening "inside" executed code since your Beam code will be wrapped up with Beam Portable SDK code, so Spark has no knowledge about that. – Alexey Romanenko Sep 12 '22 at 16:23