how to set SPARK_MAJOR_VERSION and HADOOP_USER_NAME in Airflow SPARK submit operator?

Question

I am trying to call SPARK_SUBMIT using SPARK_SUBMIT_OPERATOR, I have to set SPARK_MAJOR_VERSION and HADOOP_USER_NAME before doing SPARK_SUBMIT. Can anyone help me with it ?

I am trying to run in YARN mode , I have passed env_vars. still SPARK_MAJOR_VERSION is not set.

INFO - [2019-03-11 21:07:03,525] {base_hook.py:83} INFO - Using connection to: id: spark_default. Host: yarn://XXXX, Port: 8088, Schema: None, Login: peddnade, Password: XXXXXXXX, extra: {u'queue': u'priority', u'namespace': u'default', u'spark-home': u'/usr/'}
[2019-03-11 21:07:03,526] {logging_mixin.py:95} INFO - [2019-03-11 21:07:03,526] {spark_submit_hook.py:283} INFO - Spark-Submit cmd: [u'/usr/bin/spark-submit', '--master', 'yarn:/XX:8088', '--conf', 'spark.dynamicAllocation.enabled=true', '--conf', 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1', '--conf', 'spark.app.name=RDM', '--conf', 'spark.yarn.queue=priority', '--conf', 'spark.shuffle.service.enabled=true', '--conf', 'spark.yarn.appMasterEnv.SPARK_MAJOR_VERSION=2', '--conf', 'spark.yarn.appMasterEnv.HADOOP_USER_NAME=ppeddnade', '--files', '/usr/hdp/current/spark-client/conf/hive-site.xml', '--jars', '/usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar', '--num-executors', '4', '--total-executor-cores', '4', '--executor-cores', '4', '--executor-memory', '5g', '--driver-memory', '10g', '--name', u'airflow-spark-example', '--class', 'com.hilton.eim.job.SubmitSparkJob', '--queue', u'priority', '/home/ppeddnade/XX.jar', u'XX']
[2019-03-11 21:07:03,542] {logging_mixin.py:95} INFO - [2019-03-11 21:07:03,542] {spark_submit_hook.py:415} INFO - Multiple versions of Spark are installed but SPARK_MAJOR_VERSION is not set
[2019-03-11 21:07:03,542] {logging_mixin.py:95} INFO - [2019-03-11 21:07:03,542] {spark_submit_hook.py:415} INFO - Spark1 will be picked by default

score 0 · Answer 1 · answered Mar 06 '19 at 01:09

0

SparkSubmitOperator provides env_vars params for settings your environment variables (also available in SparkSubmitHook)

:param env_vars: Environment variables for spark-submit. It supports yarn and k8s mode too. (templated)

You can try to infer it's usage from test_spark_submit_hook.py

hook = SparkSubmitHook(conn_id='spark_standalone_cluster_client_mode',
                       env_vars={"bar": "foo"})

Even though you haven't asked for it, you might want to perform spark-submit on a remote-cluster, for that have a look at available options

answered Mar 06 '19 at 01:09

y2k-shubham

10,183
11
55
131

I am trying to run in YARN mode , I have passed env_vars. still it is not connecting – Pradeep Mar 11 '19 at 21:09
**@Pradeep** could you please elaborate `..still it is not connecting..` (*stack-trace / logs / screenshots*)? I suggest you put this information with **UPDATE** header in the question itself – y2k-shubham Mar 11 '19 at 21:13

how to set SPARK_MAJOR_VERSION and HADOOP_USER_NAME in Airflow SPARK submit operator?

1 Answers1