14

Amazon EMR Documentation to add steps to cluster says that a single Elastic MapReduce step can submit several jobs to Hadoop. However, Amazon EMR Documentation for Step configuration suggests that a single step can accommodate just one execution of hadoop-streaming.jar (that is, HadoopJarStep is a HadoopJarStepConfig rather than an array of HadoopJarStepConfigs).

What is the proper syntax for submitting several jobs to Hadoop in a step?

Harman
  • 751
  • 1
  • 9
  • 31
verve
  • 775
  • 1
  • 9
  • 21
  • can you specify by which api(language) you want to submit job.I mean in which language u want to code to submit EMR job on cluster. – hayat Nov 13 '14 at 06:20
  • There's a JSON object describing your job flow that's read by EMR no matter what, so the language you originally use to describe your job flow doesn't matter -- it gets translated to JSON by, say, the AWS CLI according to some spec. I actually don't think this spec explicitly accommodates submitting multiple jobs to Hadoop in one step, but you can probably use script_runner.jar to do it: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html . Someone who explores this more deeply can write up an answer. I'll accept a good one. – verve Oct 04 '15 at 16:36
  • Did you get a solution to this? I am still looking on how to do this using AWS SDK. I am using Javascript APIs. – krackoder Feb 04 '16 at 23:46
  • When I try to run multiple hadoop jobs in EMR cluster, they all run one after the other (I can see the progress using yarn application -list). Is there a way to run all these hadoop jobs in parallel? Will passing them multiple hadoop jobs in a single step solve this issue? How to pass multiple jobs within a single step? – abstractKarshit May 25 '16 at 20:52
  • @Karshit Let me know if the answer I just wrote up works for you.... – verve May 26 '16 at 03:18

1 Answers1

5

Like Amazon EMR Documentation says, you can create a cluster to run some script my_script.sh on the master instance in a step:

aws emr create-cluster --name "Test cluster" --ami-version 3.11 --use-default-roles
    --ec2-attributes KeyName=myKey --instance-type m3.xlarge --instance count 3
    --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=["s3://mybucket/script-path/my_script.sh"]

my_script.sh should look something like this:

#!/usr/bin/env bash

hadoop jar my_first_step.jar [mainClass] args... &
hadoop jar my_second_step.jar [mainClass] args... &
.
.
.
wait

This way, multiple jobs are submitted to Hadoop in the same step---but unfortunately, the EMR interface won't be able to track them. To do this, you should use the Hadoop web interfaces as shown here, or simply ssh to the master instance and explore with mapred job.

verve
  • 775
  • 1
  • 9
  • 21
  • Its like logging into the cluster and running two hadoop jobs (not as step, but using command "hadoop jar ... ") Here also, in EMR cluster what is happening is that, 1 of the two jobs progresses and the other keeps waiting at progress 0%. – abstractKarshit May 30 '16 at 14:04
  • 1
    @Karshit Experiment with Fair Scheduler, which is appropriate for distributing resources evenly across running jobs : http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html . This requires that you edit yarn-site.xml, which the docs at http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html tell you how to do. – verve May 31 '16 at 00:25
  • I will try this and let you know. – abstractKarshit Jun 01 '16 at 13:06
  • 1
    Yes fair scheduler works. Two jobs showed progress together. Thanks – abstractKarshit Jun 23 '16 at 09:56