2

I have a pyspark project, application entry is main.py, etl jobs need to be packaged inside jobs.zip to upload . some etl jobs need send email by a template. Need to note template files can't not be packed into jobs.zip, or it can't be read in a spark job.

My spark submit would like :

spark-submit  
--driver-memory 2g 
--executor-memory 2g 
--num-executors 1 
--executor-cores 2      
--name report_v2     
--py-files jobs.zip,libs.zip  
--files templates/accuracy_report_v2.md  
main.py 
--job report_v2 

In airflow, my dags folder structure is :

dc:~/airflow/dags$ tree
.
├── data_wrangling
│   ├── data_wrangling_dag_check.py
└── sf_dags
    ├── config.py
    ├── jobs.zip
    ├── libs.zip
    ├── main.py
    ├── sf_report.py
    ├── templates
    │   ├── accuracy_report.md
    │   └── accuracy_report_v2.md
    └── utils.py

I used to set template_path in the config of this spark job's , just a relative path templates/accuracy_report_v2.md . It was fine when I directly run spark-submit xxxx under /airflow/dags/sf_dags folder .

But airflow would complain ** can not find the **relative path files , apparently airflow didn't execute spark-submit under /airflow/dags/sf_dags folder. So I have to use absolute path, consequently spark submit would like below :

spark-submit  
--driver-memory 2g 
--executor-memory 2g 
--num-executors 1 
--executor-cores 2      
--name report_v2     
--py-files /home/dc/airflow/dags/sf_dags/jobs.zip,/home/dc/airflow/dags/sf_dags/libs.zip  
--files /home/dc/airflow/dags/sf_dags/templates/accuracy_report_v2.md  /home/dc/airflow/dags/sf_dags/main.py 
--job report_v2 
--job-args template_path=/home/dc/airflow/dags/sf_dags/templates/accuracy_report_v2.md

I have to add an extra arg --job-args template_path=/home/dc/airflow/dags/sf_dags/templates/accuracy_report_v2.md , to make sure my spark job wouldn't failed when submited by airflow . I mean this is a redundant argument , I don't like this .

How do I make airflow execute spark-submit under /airflow/dags/sf_dags folder, so I don't need to add a extra argument ?

PS:

  • AIRFLOW_HOME must be /home/dc/airflow/dags/
  • And I have many projects need put in AIRFLOW_HOME sub folders, which lead to this problem .
Mithril
  • 12,947
  • 18
  • 102
  • 153
  • I see that [`SparkSubmitHook`](https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/hooks/spark_submit_hook.py) contains several params that might come handy for you, namely [`driver_class_path`](https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/hooks/spark_submit_hook.py#L254), [`spark_home`](https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/hooks/spark_submit_hook.py#L189) and [`env_vars`](https://github.com/apache/airflow/blob/v1-10-stable/airflow/contrib/hooks/spark_submit_hook.py#L231) – y2k-shubham Sep 06 '19 at 04:08
  • For alternate ways to use `Spark` in tandem with `Airflow`, see [this](https://stackoverflow.com/a/54092691/3679900) – y2k-shubham Sep 06 '19 at 04:11
  • @y2k-shubham I post the bash command for easy understanding, actually I am using `SparkSubmitOperator` . This doesn't matter the question. – Mithril Sep 06 '19 at 06:07
  • **@Mithril** if you look at the [source code](https://github.com/apache/airflow/blob/master/airflow/contrib/operators/spark_submit_operator.py#L155), `SparkSubmitOperator` employs the `SparkSubmitHook`. This is true for almost all `operator`s / `sensor`s available in `Airflow`; they invoke some `hook` in which are just wrappers over core-functionality (like utility modules) – y2k-shubham Sep 06 '19 at 06:11
  • @y2k-shubham Sorry, I don't get your point. I understand the source code. As I know, there is no parameter can achive my goal, that why I ask this question. – Mithril Sep 06 '19 at 07:42
  • @y2k-shubham Maybe I didn't describe well, there is no problem with spark-submit in bash or in airflow . I just want to make spark-submit parameters same in bash and airflow , all use relative paths . – Mithril Sep 06 '19 at 07:48

0 Answers0