I hope you can help me. I am trying to create EMR cluster with hadoop and spark installed using datapipeline. The problem is this EMR is private so it does not have access to internet to download anything. In pipeline I indicate bootstrap actions to download all .jars and dependencies, including TaskRunner.jar.
EMRActivity of pipeline is to launch script.py
{
"name": "DefaultEmrActivity1",
"maximumRetries" : 0,
"runsOn": {
"ref": "EmrClusterId_lKm9y"
},
"id": "EmrActivityId_SRjHg",
"type": "ShellCommandActivity",
"command": "spark-submit --deploy-mode cluster --conf spark.yarn.submit.waitAppCompletion=true --py-files s3://emr/script.py"
},
But this step is not running in my EMR cluster. Instead I see "Install TaskRunner" step that tries to install the jar from internet so it is failing.
taskRunner step command:
JAR location :s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar
Main class :None
Arguments :s3://datapipeline-eu-west-1/eu-west-1/bootstrap-actions/latest/TaskRunner/install-remote-runner-v2
--workerGroup=df-08684532KKW88TTUXHVS_@EmrClusterId_lKm9y_2021-05-07T07:22:56
--endpoint=https://datapipeline.eu-west-1.amazonaws.com --region=eu-west-1
--logUri=s3://aws-logs-351516419540-eu-west-1/pipeline/df-08684532KKW88TTUXHVS/EmrClusterId_lKm9y/@EmrClusterId_lKm9y_2021-05-07T07:22:56/@EmrClusterId_lKm9y_2021-05-07T07:22:56_Attempt=1/ --taskRunnerId=54ec5b53-884b-420d-b3e6-d0e518ddf448
--zipFile=http://datapipeline-eu-west-1.s3.amazonaws.com/eu-west-1/software/latest/TaskRunner/TaskRunner-1.0.zip
--mysqlFile=http://datapipeline-eu-west-1.s3.amazonaws.com/eu-west-1/software/latest/TaskRunner/mysql-connector-java-bin.jar
--hiveCsvSerdeFile=http://datapipeline-eu-west-1.s3.amazonaws.com/eu-west-1/software/latest/TaskRunner/csv-serde.jar
--proxyHost= --proxyPort=-1 --username= --password= --windowsDomain= --windowsWorkgroup= --releaseLabel=emr-6.2.0
--jdbcDriverS3Path=s3://datapipeline-eu-west-1/eu-west-1/software/latest/TaskRunner/ --s3NoProxy=false
Action on failure:Terminate cluster
Error:
Connecting to datapipeline-eu-west-1.s3.amazonaws.com (datapipeline-eu-west-1.s3.amazonaws.com)|52.218.108.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16873 (16K) [application/octet-stream]
Saving to: ‘common/csv-serde.jar’
0K .......... ...... 100% 26.7M=0.001s
2021-05-07 07:30:44 (26.7 MB/s) - ‘common/csv-serde.jar’ saved [16873/16873]
+ '[' -n emr-6.2.0 ']'
+ sudo echo -e '\nexport HADOOP_CLASSPATH="$HADOOP_CLASSPATH:/mnt/taskRunner/common/mysql-connector-java-bin.jar:/etc/hadoop/hive/lib/hive-exec.jar"'
+ sudo tee -a /etc/hadoop/conf/hadoop-env.sh
+ bash /etc/hadoop/conf/hadoop-env.sh
+ '[' -z emr-6.2.0 ']'
+ unzip -o taskRunner.zip
+ chmod 500 aws-datapipeline-taskrunner-v2.sh
+ '[' -d /usr/share/aws/emr/goodies/lib ']'
+ '[' -n emr-6.2.0 ']'
+ EMR_HADOOP_GOODIES_NAME='emr-hadoop-goodies-*jar'
+ EMR_HIVE_GOODIES_NAME='emr-hive-goodies-*jar'
+ OPEN_CSV_PATH=/usr/lib/hive/lib/
++ find /usr/share/aws/emr/goodies/lib -name 'emr-hadoop-goodies-*jar'
+ emr_goodies_jar=/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.1.0.jar
+ '[' -n /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.1.0.jar ']'
+ open_csv_symlink=/mnt/taskRunner/open-csv.jar
+ emr_goodies_symlink=/mnt/taskRunner/oncluster-emr-hadoop-goodies.jar
+ emr_hive_goodies_symlink=/mnt/taskRunner/oncluster-emr-hive-goodies.jar
+ sudo rm -f /mnt/taskRunner/open-csv.jar
+ sudo rm -f /mnt/taskRunner/oncluster-emr-hadoop-goodies.jar
+ sudo rm -f /mnt/taskRunner/oncluster-emr-hive-goodies.jar
++ find /usr/share/aws/emr/goodies/lib -name 'emr-hive-goodies-*jar'
+ emr_hive_jar=/usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.1.0.jar
++ find /usr/lib/hive/lib/ -name 'opencsv*jar'
+ open_csv_jar='/usr/lib/hive/lib/opencsv-2.3.jar
/usr/lib/hive/lib/opencsv-3.9.jar'
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hadoop-goodies-3.1.0.jar /mnt/taskRunner/oncluster-emr-hadoop-goodies.jar
+ sudo ln -s /usr/share/aws/emr/goodies/lib/emr-hive-goodies-3.1.0.jar /mnt/taskRunner/oncluster-emr-hive-goodies.jar
+ sudo ln -s /usr/lib/hive/lib/opencsv-2.3.jar /usr/lib/hive/lib/opencsv-3.9.jar /mnt/taskRunner/open-csv.jar
ln: target ‘/mnt/taskRunner/open-csv.jar’ is not a directory
Command exiting with ret '1'
I don't know why the link can´t be created as EMR terminates in step failure and I can´t check it.
But I don't want this step to be executed as these jars will be installed in bootstrap. Any advice on how to avoid this step to run?
Thanks