1

I am unable to set environment variables for my spark application. I am using AWS EMR to run a spark application. Which is more like a framework I wrote in python on top of spark, to run multiple spark jobs according to environment variables present. So in order for me to start the exact job, I need to pass the environment variable into the spark-submit. I tried several methods to do this. But none of them works. As I try to print the value of the environment variable inside the application it returns empty.

To run the cluster in the EMR I am using following AWS CLI command

aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark --ec2-attributes '{"KeyName":"<Key>","InstanceProfile":"<Profile>","SubnetId":"<Subnet-Id>","EmrManagedSlaveSecurityGroup":"<Group-Id>","EmrManagedMasterSecurityGroup":"<Group-Id>"}' --release-label emr-5.13.0 --log-uri 's3n://<bucket>/elasticmapreduce/' --bootstrap-action 'Path="s3://<bucket>/bootstrap.sh"' --steps file://./.envs/steps.json  --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"c4.xlarge","Name":"Master"}]' --configurations file://./.envs/Production.json --ebs-root-volume-size 64 --service-role EMRRole --enable-debugging --name 'Application' --auto-terminate --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region <region>

Now Production.json looks like this:

[
  {
   "Classification": "yarn-env",
   "Properties": {},
   "Configurations": [
       {
         "Classification": "export",
         "Properties": {
             "FOO": "bar"
         }
       }
   ]
 },
 {
  "Classification": "spark-defaults",
  "Properties": {
    "spark.executor.memory": "2800m",
    "spark.driver.memory": "900m"
  }
 }
]

And steps.json like this :

[
  {
    "Name": "Job",
    "Args": [
      "--deploy-mode","cluster",
      "--master","yarn","--py-files",
      "s3://<bucket>/code/dependencies.zip",
      "s3://<bucket>/code/__init__.py",
      "--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
      "--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
      "--conf", "spark.executorEnv.SHAPE=SQUARE"

    ],
    "ActionOnFailure": "CONTINUE",
    "Type": "Spark"
  }

]

When I try to access the environment variable inside my __init__.py code, it simply prints empty. As you can see I am running the step using spark with yarn cluster in cluster mode. I went through these links to reach this position.

Thanks for any help.

Appunni M
  • 83
  • 3
  • 9

2 Answers2

0

Use classification yarn-env to pass environment variables to the worker nodes.

Use classification spark-env to pass environment variables to the driver, with deploy mode client. When using deploy mode cluster, use yarn-env.

(Dear moderator, if you want to delete the post, let me know why.)

rwitzel
  • 1,694
  • 17
  • 21
-1

To work with EMR clusters I work using the AWS Lambda, creating a project that build an EMR cluster when a flag is set in the condition. Inside this project, we define the variables that you can set in the Lambda and then, replace this to its value. To use this, we have to use the AWS API. The possible method you have to use is the AWSSimpleSystemsManagement.getParameters. Then, make a map like val parametersValues = parameterResult.getParameters.asScala.map(k => (k.getName, k.getValue)) to have a tuple with its name and value.

Eg: ${BUCKET} = "s3://bucket-name/ What this means, you only have to write in your JSON ${BUCKET} instead all the name of your path.

Once you have replace the value, the step JSON can have a view like this,

[
  {
    "Name": "Job",
    "Args": [
      "--deploy-mode","cluster",
      "--master","yarn","--py-files",
      "${BUCKET}/code/dependencies.zip",
      "${BUCKET}/code/__init__.py",
      "--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
      "--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
      "--conf", "spark.executorEnv.SHAPE=SQUARE"

    ],
    "ActionOnFailure": "CONTINUE",
    "Type": "Spark"
  }

]

I hope this can help you to solve your problem.

H. M.
  • 109
  • 1
  • 1
  • 9
  • That means having to create a bucket or path for every possible configuration. That seems like an overkill for such a small part. Instead I can just configure the path of configuration inside yarn-env dynamically and put all my configuration in the path as a json and load it use HDFS. – Appunni M Apr 20 '18 at 03:06