In Apache Beam's Spark documentation, it says that you can specify --environment_type="DOCKER"
to customize the runtime environment:
The Beam SDK runtime environment can be containerized with Docker to isolate it from other runtime systems. To learn more about the container environment, read the Beam SDK Harness container contract.
...
You may want to customize container images for many reasons, including:
- Pre-installing additional dependencies
- Launching third-party software in the worker environment
- Further customizing the execution environment
...
python -m apache_beam.examples.wordcount \
--input=/path/to/inputfile \
--output=path/to/write/counts \
--runner=SparkRunner \
# When running batch jobs locally, we need to reuse the container.
--environment_cache_millis=10000 \
--environment_type="DOCKER" \
--environment_config="${IMAGE_URL}"
If you submit this job to an existing Spark cluster, what does the docker image do to the Spark cluster? Does it run all the Spark executors with that Docker image? What happens to the existing Spark executors if there are any? What about the Spark driver? What is the mechanism used (Spark Driver API?) to distribute the Docker image to the machines?