9

I'm using the following Scala code (as a custom spark-submit wrapper) to submit a Spark application to a YARN cluster:

val result = Seq(spark_submit_script_here).!!

All I have at the time of submission is spark-submit and the Spark application's jar (no SparkContext). I'd like to capture applicationId from result, but it's empty.

I can see in my command line output the applicationId and rest of the Yarn messages:

INFO yarn.Client: Application report for application_1450268755662_0110

How can I read it within code and get the applicationId ?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
nish1013
  • 3,658
  • 8
  • 33
  • 46
  • Are you talking about `SparkContext.applicationId`? – Markon Jan 04 '16 at 09:50
  • I think that yarn.Client is somehow getting the SparkContext.applicationId - you could do the same. – Markon Jan 04 '16 at 09:53
  • 1
    Possible duplicate of [spark Yarn mode how to get applicationId from spark-submit](https://stackoverflow.com/questions/44209462/spark-yarn-mode-how-to-get-applicationid-from-spark-submit) – Rahul Sharma Jun 05 '17 at 18:09

4 Answers4

8

As stated in the Spark issue 5439, you could either use SparkContext.applicationId or parse the stderr output. Now, as you are wrapping the spark-submit command with your own script/object, I would say you need to read the stderr and get the application id.

Markon
  • 4,480
  • 1
  • 27
  • 39
7

If you are submitting the job via Python, then this is how you can get the yarn application id:

    cmd_list = [{
            'cmd': '/usr/bin/spark-submit --name %s --master yarn --deploy-mode cluster '
                   '--executor-memory %s --executor-cores %s --num-executors %s '
                   '--class %s %s %s'
                   % (
                       app_name,
                       config.SJ_EXECUTOR_MEMORY,
                       config.SJ_EXECUTOR_CORES,
                       config.SJ_NUM_OF_EXECUTORS,
                       config.PRODUCT_SNAPSHOT_SKU_PRESTO_CLASS,
                       config.SPARK_JAR_LOCATION,
                       config.SPARK_LOGGING_ENABLED
                   ),
            'cwd': config.WORK_DIR
        }]
cmd_output = subprocess.run(cmd_obj['cmd'], shell=True, check=True, cwd=cwd, stderr=subprocess.PIPE)
cmd_output = cmd_output.stderr.decode("utf-8")
yarn_application_ids = re.findall(r"application_\d{13}_\d{4}", cmd_output)
                if len(yarn_application_ids):
                    yarn_application_id = yarn_application_ids[0]
                    yarn_command = "yarn logs -applicationId " + yarn_application_id
Rajiv
  • 392
  • 6
  • 22
6

Use the spark context to get application info.

sc.getConf.getAppId 
res7: String = application_1532296406128_16555
mmopu
  • 91
  • 1
  • 6
0

as Rajiv's answer , the regex 'application_\d{13}_\d{4}' is not correct

actualy, the job id will increase greater than 9999, so the regex of application_\d{13}_\d{4,} will just working

and the java code

   public static final String APPLICATION_REGEX="application_\\d+_\\d{4,}+";

   /**
     * get yarn application id list
     * @param log   log content
     * @return app id list
     */
    public static List<String> getAppIds(String log) {
        List<String> appIds = new ArrayList<>();
        Matcher matcher = APPLICATION_REGEX.matcher(log);
        while (matcher.find()) {
            String appId = matcher.group();
            if(!appIds.contains(appId)){
                appIds.add(appId);
            }
        }
        return appIds;
    }

geosmart
  • 518
  • 4
  • 15