7

I am trying to have 2 steps run concurrent in EMR. However I always get the first step running and the second pending.

Part of my Yarn configuration is as follows:

{
    "Classification": "capacity-scheduler",
    "Properties": {
    "yarn.scheduler.capacity.resource-calculator": "org.apache.hadoop.yarn.util.resource.DominantResourceCalculator",
    "yarn.scheduler.capacity.maximum-am-resource-percent": "0.5"
    }
  }

When I run on my local Mac I am able to run the 2 application on Yarn with similar configuration, where the change are actually spark submit resource request, to match the cluster capacity and performance required.

In other words, My yarn is set up to run multiple application.

Hence, before i dig hard into it, i wonder if it is actually possible to have the step run concurrently or only serially ?

Else is there any tips or something specific to run to job concurrently ?

My cluster is over capacitated with respect to what each job request. Hence i don't not understand why it can't run concurrently.

MaatDeamon
  • 9,532
  • 9
  • 60
  • 127
  • it's possible to parallel execution if you know in advance these two tasks by means of spark itself (e.g. with sc.parallelize). But maybe it's not what you're looking for. – Mikhail Berlinkov Nov 04 '18 at 14:07
  • Thank you for your note. Would you elaborate on your answer please ? So I may figure out if it may serve my purpose. Those are two applications btw, 2 spark submit using EMR step . – MaatDeamon Nov 04 '18 at 14:09
  • What I meant is, if you have code for these two tasks in one project so that they could be run from the same place, then you can wrap them in one task with sc.parallelize to execute in parallel. – Mikhail Berlinkov Nov 04 '18 at 14:11

5 Answers5

5
  • Is it possible to have the step run concurrently or only serially?

    • Confirmed from AWS support people that we can not run multiple steps in parallel(concurrent), the steps are serial, so what you are seeing (ie second job in pending state) is expected.
  • Is there any tips or something specific to run to job concurrently?

    • You can simply put both the spark-submit in a bash script and run the bash script, but you might loose some direct debugging info on the AWS web console (which imo is slow already), you can see these debugging info on the spark-history server

On your local mac, you are able to run multiple YARN application in parallel because you are submitting the applications to yarn directly, whereas in EMR the yarn/spark applications are submitted through AWS's internal `command-runner.jar`, it does a bunch of other logging/bootstrapping etc to be able to see the `emr step` info on the web console.

N_C
  • 952
  • 8
  • 17
4

It looks that AWS finally implemented this feature in EMR 5.28.0!

The parameter is called "Concurrency" in the console wizard or StepConcurrencyLevel in the API:

Specifies the number of steps that can be executed concurrently. The default value is 1. The maximum value is 256.

Mariusz
  • 13,481
  • 3
  • 60
  • 64
2

There are 2 modes of running application in AWS EMR Yarn:

  • Client
  • Cluster

If you use client mode then only one step will be in running state at a given time. However there is an option where in you can run more then 1 step concurrently.

try submitting your step in blow mode: spark-submit --master yarn --deploy-mode cluster --executor-memory 1G --num-executors 2 --driver-memory 1g --executor-cores 2 --conf spark.yarn.submit.waitAppCompletion=false --class WordCount.word.App /home/hadoop/word.jar

  1. Instead of letting AWS EMR define memory allocation try defining your allocation. Refer to link: http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
  2. spark.yarn.submit.waitAppCompletion=false : In YARN cluster mode, controls whether the client waits to exit until the application completes. If set to true, the client process will stay alive reporting the application's status. Otherwise, the client process will exit after submission.

Hope this may of help for you.

Jack
  • 197
  • 1
  • 21
1

AWS now allows you to run steps concurrently in the later versions of EMR. https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/

One thing to note while doing this is to take care of resources, as your applications would be fighting for the available resource and one of them might end up in an accepted state not starting until the other one finishes, defeating the purpose.

0

you could always put the step in the background. shouldn't be a problem if you handle logging and race conditions.

step-job.sh

#!/bin/bash

function main(){
    do_this
    do_that
}

if [[ "$1" == "1" ]]; then
    main
else
    /bin/bash "$0" 1 $@ &
fi
Justin
  • 285
  • 1
  • 5
  • 11