1

I am trying to run an EMR (1 master and 2 core nodes) step with a very simple python script that i uploaded to s3 to be used in EMR spark application step. This script reads a data.txt file in S3 and saves it back, and it can be seen below,

import pyspark
import boto3

sc = SparkContext()
text_file = sc.textFile('s3://First_bucket/data.txt')
text_file.repartition(1).saveAsTextFile('s3://First_bucket/logdata')
sc.stop()

However, this straight script does not cause a bug when import boto3 is not used. To fix this problem i have tried to add a bootstrap action with boto.sh file while I am creating my EMR cluster. The boto.sh file that i used is as the follow,

#!/bin/bash

sudo easy_install-3.6 pip
sudo pip install --target /usr/lib/spark/python/ boto3

Unfortunately, this just enabled boto3 library on master node not core nodes. Again the EMR step of doing this is failed, and the error log file is:

2020-02-08T20:56:49.698Z INFO Ensure step 4 jar file command-runner.jar
2020-02-08T20:56:49.699Z INFO StepRunner: Created Runner for step 4
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster s3://First_bucket/data.py'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=eu-central-1
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-2V51S7I25TLLW/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  PYTHON_INSTALL_LAYOUT=amzn
  HOSTNAME=ip-***-***-***-***
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-2V51S7I25TLLW
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  EMR_STEP_ID=s-2V51S7I25TLLW
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-2V51S7I25TLLW/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-2V51S7I25TLLW/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-2V51S7I25TLLW
INFO ProcessRunner started child process 22893
2020-02-08T20:56:49.705Z INFO HadoopJarStepRunner.Runner: startRun() called for s-2V51S7I25TLLW Child Pid: 22893
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 1 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 26 seconds
2020-02-08T20:57:15.787Z INFO Step created jobs: 
2020-02-08T20:57:15.787Z WARN Step failed with exitCode 1 and took 26 seconds

My question is how to use EMR spark application step with python script that contains libraries such as boto3. Thank in advance.

Yousef
  • 29
  • 6

1 Answers1

3

The answer is Bootstrap actions

While creating the cluster and by adding a bootstrap action[1], you will be able to install the boto3 package. Otherwise, for a running cluster you will need to install boto3 on all nodes manually by connect to nodes or using Chef, ansible,...

The bootstrap action will be like:

sudo pip-3.6 install boto3 

Or

sudo pip install boto3 

Note: Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data.

The logs of running boostrap action will be located in '/mnt/var/log/bootstrap-actions' on all nodes.

[1]- Create Bootstrap Actions to Install Additional Software - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html

  • I have read the document. It says that adding bootstrap action while creating EMR cluster is applied on all the core nodes. However, when i do it, the .sh file is just applied over the master node (not over all core nodes). I am trying to fix this. Thank you – Yousef Feb 08 '20 at 23:37
  • Bootstrap actions should run on all nodes, have you tried to SSH the core nodes and make sure the package isn't installed using pip list. Finally you can check mnt/var/log/hadoop/steps/s-2V51S7I25TLLW/stdout and stderr. It might be another error – Abdelrahman Maharek Feb 09 '20 at 10:21
  • you can check the logs '/mnt/var/log/bootstrap-actions' on core nodes to make sure the script ran successfully on core nodes – Abdelrahman Maharek Feb 09 '20 at 10:48
  • Everything is ran successfully in stdout and stderr of the bootstrap-action i made. However, the python script that uses import boto3 does not work when it is used in EMR spark-submit as a step. any idea to fix this problem? – Yousef Feb 09 '20 at 11:29
  • Oky the problem is solved. Actually, my bash file is written wrongly. The true one can be found here: https://stackoverflow.com/questions/31525012/how-to-bootstrap-installation-of-python-modules-on-amazon-emr – Yousef Feb 09 '20 at 11:36
  • sudo pip-3.6 install boto3 should work also. It depends which python version you are using in PySpark so the command can be used like 'sudo pip install boto3'. The best way to make sure the bootstrap action, is by running the command on all nodes using SSH and try the spark command manually – Abdelrahman Maharek Feb 09 '20 at 11:49