I have a pyspark project, application entry is main.py
, etl jobs need to be packaged inside jobs.zip
to upload .
some etl jobs need send email by a template.
Need to note template files can't not be packed into jobs.zip
, or it can't be read in a spark job.
My spark submit would like :
spark-submit
--driver-memory 2g
--executor-memory 2g
--num-executors 1
--executor-cores 2
--name report_v2
--py-files jobs.zip,libs.zip
--files templates/accuracy_report_v2.md
main.py
--job report_v2
In airflow, my dags folder structure is :
dc:~/airflow/dags$ tree
.
├── data_wrangling
│ ├── data_wrangling_dag_check.py
└── sf_dags
├── config.py
├── jobs.zip
├── libs.zip
├── main.py
├── sf_report.py
├── templates
│ ├── accuracy_report.md
│ └── accuracy_report_v2.md
└── utils.py
I used to set template_path
in the config of this spark job's , just a relative path templates/accuracy_report_v2.md
.
It was fine when I directly run spark-submit xxxx under /airflow/dags/sf_dags
folder .
But airflow would complain ** can not find the **relative path files , apparently airflow didn't execute spark-submit under /airflow/dags/sf_dags
folder.
So I have to use absolute path, consequently spark submit would like below :
spark-submit
--driver-memory 2g
--executor-memory 2g
--num-executors 1
--executor-cores 2
--name report_v2
--py-files /home/dc/airflow/dags/sf_dags/jobs.zip,/home/dc/airflow/dags/sf_dags/libs.zip
--files /home/dc/airflow/dags/sf_dags/templates/accuracy_report_v2.md /home/dc/airflow/dags/sf_dags/main.py
--job report_v2
--job-args template_path=/home/dc/airflow/dags/sf_dags/templates/accuracy_report_v2.md
I have to add an extra arg --job-args template_path=/home/dc/airflow/dags/sf_dags/templates/accuracy_report_v2.md
, to make sure my spark job wouldn't failed when submited by airflow .
I mean this is a redundant argument , I don't like this .
How do I make airflow execute spark-submit under /airflow/dags/sf_dags
folder, so I don't need to add a extra argument ?
PS:
AIRFLOW_HOME
must be/home/dc/airflow/dags/
- And I have many projects need put in
AIRFLOW_HOME
sub folders, which lead to this problem .