2

We have existing code in production that runs Spark jobs in parallel. We tried to orchestrate some mundane spark jobs using Airflow and we had success BUT now we are not sure how to proceed with spark jobs in parallel.

Can CeleryExecutor help in this case?

Or should we modify our existing Spark job not to run in parallel. I do not like the latter approach personally.

Our existing shell script that has spark job in parallel is something like this and we would like to run this shell script from Airflow:

cat outfile.txt | parallel -k -j2 submitspark {} /data/list

Please suggest.

user3666197
  • 1
  • 6
  • 50
  • 92
Space X
  • 97
  • 1
  • 7
  • you should check Job Scheduling from spark documentation and can be done with fair scheduler if yarn being used in your case. https://spark.apache.org/docs/latest/job-scheduling.html Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests – oetzi Sep 09 '19 at 16:37
  • We need to orchestrate several spark jobs using Airflow. My questions is more about the Airflow. How to use Airflow when the existing shell script has spark jobs submitted in parallel? – Space X Sep 09 '19 at 16:44
  • Irrespective of your `Airflow` deployment (`LocalExecutor` / `CeleryExecutor` / `KubernetesExecutor`), I'd suggest that memory-intensive jobs like `Spark`, `Hive`, should have a separate infrastructure; use `Airflow` as a **pure-orchestrator** (only submitting jobs to remote cluster and waiting for completion). Using this technique, we are running a DAG containing ~ 800 `Tez` / `MapReduce` / `Hive` tasks daily using `LocalExecutor`. For `Spark`, see [this](https://stackoverflow.com/a/54092691/3679900) and for on-demand infra, see [this](https://stackoverflow.com/a/55233359/3679900) – y2k-shubham Sep 10 '19 at 03:49
  • Thanks for both replies above.When should we use CeleryExecutor vs. LocalExecutor? – Space X Sep 10 '19 at 14:05

0 Answers0