5

I am actually running tasks through a Mesos stack, which use Docker containers.

Sometimes, some tasks are failing.

Here are some of the related TaskStatus messages and reasons:

message: Container exited with status 1 - reason: REASON_COMMAND_EXECUTOR_FAILED
message: Container exited with status 42 - reason: REASON_COMMAND_EXECUTOR_FAILED
message: Container exited with status 137 - reason: REASON_COMMAND_EXECUTOR_FAILED

Is there a table of correspondance that links container error status codes from TaskStatus message with more explicit errors ?

Axel Borja
  • 3,718
  • 7
  • 36
  • 50

2 Answers2

6

Command tasks could fail for several reasons and set proper exit code. For example Docker 1.10 set exit status codes like this (from documentation and this answer):

The exit code from docker run gives information about why the container failed to run or why it exited. When docker run exits with a non-zero code, the exit codes follow the chroot standard, see below:

125 if the error is with Docker daemon itself:

$ docker run --foo busybox; echo $?
# flag provided but not defined: --foo   See 'docker run --help'.   

126 if the contained command cannot be invoked:

$ docker run busybox /etc; echo $?
# docker: Error response from daemon: Container command '/etc' could not be invoked.   

127 if the contained command cannot be found

$ docker run busybox foo; echo $?
# docker: Error response from daemon: Container command 'foo' not found or does not exist.   127 Exit code of contained command

otherwise

$ docker run busybox /bin/sh -c 'exit 3'; echo $?
# 3

Another exit code rule could be found here

| Code  |            Meaning             |         Example         |                                                   Comments                                                   |
|-------|--------------------------------|-------------------------|--------------------------------------------------------------------------------------------------------------|
| 1     | Catchall for general errors    | let "var1 = 1/0"        | Miscellaneous errors, such as "divide by zero" and other impermissible operations                            |
| 2     | Misuse of shell builtins       | empty_function() {}     | Missing keyword or command, or permission problem (and diff return code on a failed binary file comparison). |
| 126   | Command invoked cannot execute | /dev/null               | Permission problem or command is not an executable                                                           |
| 127   | "command not found"            | illegal_command         | Possible problem with $PATH or a typo                                                                        |
| 128   | Invalid argument to exit       | exit 3.14159            | exit takes only integer args in the range 0 - 255 (see first footnote)                                       |
| 128+n | Fatal error signal "n"         | kill -9 $PPID of script | $? returns 137 (128 + 9)                                                                                     |
| 130   | Script terminated by Control-C | Ctl-C                   | Control-C is fatal error signal 2, (130 = 128 + 2, see above)                                                |
| 255*  | Exit status out of range       | exit -1                 | exit takes only integer args in the range 0 - 255                                                            |

According to your examples:

If you need more information to explain status code you can check Message field in Mesos TaskStatus update, for example Mesos put there information about OOM. Same information could be also find in Mesos logs. To debug why command returned non zero code you may check files stored in executor sandbox especially stderr/stdout or command specific logs.

Community
  • 1
  • 1
janisz
  • 6,292
  • 4
  • 37
  • 70
  • Unless I'm mistaken the above assumes that Mesos uses Docker to run tasks which may not be always true? – Jacek Laskowski May 06 '17 at 15:11
  • 1
    That's right. Docker is optional conntenerizer for Mesos and currently is being replaced with Mesos conntenerizer. Unfortunately question does not specify what is the Mesos configuration nor what framework is used. I assume it's docker becouse message indicates that container exited and external conntenerizer was deprecated and Mesos conntenerizer uses `EXIT_FAILURE` (1) exit code. – janisz May 06 '17 at 15:33
1

Guess you want to review enum Reason in mesos.proto (copied below):

  enum Reason {
    // TODO(jieyu): The default value when a caller doesn't check for
    // presence is 0 and so ideally the 0 reason is not a valid one.
    // Since this is not used anywhere, consider removing this reason.
    REASON_COMMAND_EXECUTOR_FAILED = 0;

    REASON_CONTAINER_LAUNCH_FAILED = 21;
    REASON_CONTAINER_LIMITATION = 19;
    REASON_CONTAINER_LIMITATION_DISK = 20;
    REASON_CONTAINER_LIMITATION_MEMORY = 8;
    REASON_CONTAINER_PREEMPTED = 17;
    REASON_CONTAINER_UPDATE_FAILED = 22;
    REASON_EXECUTOR_REGISTRATION_TIMEOUT = 23;
    REASON_EXECUTOR_REREGISTRATION_TIMEOUT = 24;
    REASON_EXECUTOR_TERMINATED = 1;
    REASON_EXECUTOR_UNREGISTERED = 2;
    REASON_FRAMEWORK_REMOVED = 3;
    REASON_GC_ERROR = 4;
    REASON_INVALID_FRAMEWORKID = 5;
    REASON_INVALID_OFFERS = 6;
    REASON_IO_SWITCHBOARD_EXITED = 27;
    REASON_MASTER_DISCONNECTED = 7;
    REASON_RECONCILIATION = 9;
    REASON_RESOURCES_UNKNOWN = 18;
    REASON_SLAVE_DISCONNECTED = 10;
    REASON_SLAVE_REMOVED = 11;
    REASON_SLAVE_RESTARTED = 12;
    REASON_SLAVE_UNKNOWN = 13;
    REASON_TASK_CHECK_STATUS_UPDATED = 28;
    REASON_TASK_GROUP_INVALID = 25;
    REASON_TASK_GROUP_UNAUTHORIZED = 26;
    REASON_TASK_INVALID = 14;
    REASON_TASK_UNAUTHORIZED = 15;
    REASON_TASK_UNKNOWN = 16;
  }
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • 2
    Hello Jacek, thanks for your answer, this list will clearly help me in the future. However my question is much more about the container error status codes from TaskStatus message. – Axel Borja May 05 '17 at 11:34