1

I'm trying to do spark submit from Airflow in Cluster mode, I wanted to specify log4j properties in submit operator

task_id='spark_submit_job',
conn_id='spark_default',
files='/usr/hdp/current/spark-client/conf/hive-site.xml',
jars='/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar',
java_class='com.xxx.eim.job.SubmitSparkJob',
application='/root/airflow/code/eimdataprocessor.jar',
total_executor_cores='4',
executor_cores='4',
executor_memory='5g',
num_executors='4',
name='airflow-spark-example',
verbose=False,
driver_memory='10g',
application_args=["XXX"],
conf={'master':'yarn',
     'spark.yarn.queue'='priority',
     'spark.app.name'='XXX',
     'spark.dynamicAllocation.enabled'='true'},
     'spark.local.dir'='/opt/eim',
     'spark.shuffle.service.enabled'='true',
     'spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored'='true',
     'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version'='2'
     },
dag=dag)
Pradeep
  • 59
  • 3

1 Answers1

1

I can think of two possible ways

  1. Log4J properties as Spark configuration

    • Basically trying to achieve the same effect as this line in your spark-submit command: --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties"
    • Passed via conf parameter in SparkSubmitOperator
    • Do note that if you are writing a custom operator to perform spark-submit on a remote system, passing a Log4J configuration-file could be challenging
  2. As Application-Args

    • Application-args are appended as-it-is to spark-submit command, so make sure you pass them with any --prefix(es), if needed
    • Passed via application_args in SparkSubmitOperator
y2k-shubham
  • 10,183
  • 11
  • 55
  • 131