I am unable to set environment variables
for my spark application. I am using AWS EMR
to run a spark application. Which is more like a framework I wrote in python on top of spark, to run multiple spark jobs according to environment variables present. So in order for me to start the exact job, I need to pass the environment variable into the spark-submit
. I tried several methods to do this. But none of them works. As I try to print the value of the environment variable inside the application it returns empty.
To run the cluster in the EMR I am using following AWS CLI command
aws emr create-cluster --applications Name=Hadoop Name=Hive Name=Spark --ec2-attributes '{"KeyName":"<Key>","InstanceProfile":"<Profile>","SubnetId":"<Subnet-Id>","EmrManagedSlaveSecurityGroup":"<Group-Id>","EmrManagedMasterSecurityGroup":"<Group-Id>"}' --release-label emr-5.13.0 --log-uri 's3n://<bucket>/elasticmapreduce/' --bootstrap-action 'Path="s3://<bucket>/bootstrap.sh"' --steps file://./.envs/steps.json --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"c4.xlarge","Name":"Master"}]' --configurations file://./.envs/Production.json --ebs-root-volume-size 64 --service-role EMRRole --enable-debugging --name 'Application' --auto-terminate --scale-down-behavior TERMINATE_AT_TASK_COMPLETION --region <region>
Now Production.json
looks like this:
[
{
"Classification": "yarn-env",
"Properties": {},
"Configurations": [
{
"Classification": "export",
"Properties": {
"FOO": "bar"
}
}
]
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.memory": "2800m",
"spark.driver.memory": "900m"
}
}
]
And steps.json
like this :
[
{
"Name": "Job",
"Args": [
"--deploy-mode","cluster",
"--master","yarn","--py-files",
"s3://<bucket>/code/dependencies.zip",
"s3://<bucket>/code/__init__.py",
"--conf", "spark.yarn.appMasterEnv.SPARK_YARN_USER_ENV=SHAPE=TRIANGLE",
"--conf", "spark.yarn.appMasterEnv.SHAPE=RECTANGLE",
"--conf", "spark.executorEnv.SHAPE=SQUARE"
],
"ActionOnFailure": "CONTINUE",
"Type": "Spark"
}
]
When I try to access the environment variable inside my __init__.py
code, it simply prints empty. As you can see I am running the step using spark with yarn cluster
in cluster mode
. I went through these links to reach this position.
- How do I set an environment variable in a YARN Spark job?
- https://spark.apache.org/docs/latest/configuration.html#environment-variables
- https://spark.apache.org/docs/latest/configuration.html#runtime-environment
Thanks for any help.