4

What I did

  1. Started services on an AlmaLinux server with docker-compose up
  2. Noticed output of docker-compose logs wasn't changing for a while
  3. Check docker-compose ps
$ docker-compose ps
              Name                            Command                State     Ports
------------------------------------------------------------------------------------
mysupercoolsystem_api_1           python -m mysupercoolsyste ...   Exit 137
mysupercoolsystem_dev_1           sh -c jupyter lab --ip=0.0 ...   Exit 137
mysupercoolsystem_loader_1        /bin/sh -c python -m mysup ...   Exit 137
mysupercoolsystem_predictor_1     /bin/sh -c python -m mysup ...   Exit 137
mysupercoolsystem_trainer_1       /bin/sh -c python -m mysup ...   Exit 137


$ docker ps -a  # just to confirm
72708f3450   hub.nic.dk/nicecompany/mysupercoolsystem   "/bin/sh -c 'python …"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_trainer_1
3e286cabb0   jupyter/scipy-notebook:33add21fab64        "sh -c 'jupyter lab …"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_dev_1
246b87f0ac   hub.nic.dk/nicecompany/mysupercoolsystem   "/bin/sh -c 'python …"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_predictor_1
7d3297092c   hub.nic.dk/nicecompany/mysupercoolsystem   "python -m mysuperc …"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_api_1
2a07851f9c   hub.nic.dk/nicecompany/mysupercoolsystem   "/bin/sh -c 'python …"   2 days ago    Exited (137) 2 days ago              mysupercoolsystem_loader_1

  1. Research whether containers were stopped because of out-of-memory
    • Checked virtual host: The docker containers run on a single virtual (vcenter-managed) host. The host is allocated 20GB of RAM and vcenter monitor shows RAM usage peaks at ca. 8GB and not more.
    • Follow-up: Talked to sysadmin: Servers were not restarted or explicitly asked to terminate any processes.
    • docker info | grep Memory returns Total Memory: 19.37GiB
    • checked each container with docker inspect <container_id> gives the same "State", apart from the field "FinishedAt" which varies with ±0.05 seconds.
"State": {
  "Status": "exited",
  "Running": false,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 0,
  "ExitCode": 137,
  "Error": "",
  "StartedAt": "2021-11-13T10:33:04.785566471Z",
  "FinishedAt": "2021-11-13T10:33:57.1xxxxZ"
  1. Re-examined my docker-compose.yml.
$ cat docker-compose.yml
version: "3"
services:
  dev:
    image: jupyter/scipy-notebook:33add21fab64
    environment:
      - COMPONENT=develop
    volumes:
      - /opt/mysupercoolsystem:/home/jovyan/work
      - /media:/media
    ports:
      - "3333:3333"
    entrypoint: sh -c "jupyter lab --ip=0.0.0.0 --port=3333 --no-browser --allow-root"

  loader:
    image: hub.nic.com/nicecompany/mysupercoolsystem
    working_dir: "/app"
    volumes:
      - /media:/media

  trainer:
    image: hub.nic.dk/nicecompany/mysupercoolsystem
    environment:
      - COMPONENT=train
    working_dir: "/app"
    volumes:
      - models:/models

  predictor:
    image: hub.nic.dk/nicecompany/mysupercoolsystem
    environment:
      - COMPONENT=pred
    working_dir: "/app"
    volumes:
      - models:/models

  api:
    image: hub.nic.dk/nicecompany/mysupercoolsystem
    environment:
      - COMPONENT=api
    working_dir: "/app"
    ports:
      - "69:69"
    entrypoint: python -m mysupercoolsystem.web_api

volumes:
  models:
  1. Examine Dockerfile. Note: Services that do not have an explicit entrypoint in docker-compose.yml inherit the entrypoint from the Dockerfile.
$ cat mysupercoolsystem/Dockerfile
FROM python:3.8
WORKDIR /app
COPY ./requirements.txt /app/requirements.txt
RUN pip install -r requirements.txt
COPY . /app
RUN pip install .
ENTRYPOINT python -m mysupercoolsystem
  1. Checked similair issue (this issue had --abort-on-container-exit-flag as the culprit. I am not using any flags).

How to proceed

  • Why are the services exiting?
  • What can I do to troubleshoot the error?
  • Are there other logs I should be checking?
  • If I add restart: unless-stopped on each service, is there any way to examine docker service exits apart from my own logging via docker logs?
DannyDannyDanny
  • 838
  • 9
  • 26

3 Answers3

1

I had a && sleep 1h at the end of my shell script for signaling that it's done (running on this node) and a health-cmd of pidof sleep configured. Looks like something changed in the Alpine container I'm using which lead to sleep not running as an extra process and in the end dockers health check killed the shell in the container. Also leading to a 137 (SIGKILL) exit code but not triggerd by the OOM Killer.

thomas
  • 175
  • 1
  • 9
0

You can use https://pythonspeed.com/fil/ to debug out-of-memory errors in Python (see https://pythonspeed.com/articles/crash-out-of-memory/).

Itamar Turner-Trauring
  • 3,430
  • 1
  • 13
  • 17
  • 2
    The docker containers run on a single virtual (vcenter-managed) host. The host is allocated 20GB of RAM and vcenter monitoring shows RAM usage peaks at 8GB and never more. The logs also suggest that OOM is not the problem: `"OOMKilled": false`. Also the program is not supposed to exit - rather it runs in a while-true-do-sleep loop. – DannyDannyDanny Nov 17 '21 at 14:26
  • It's jsut that exit code 137 is kill -9, I believe, and something that often does this is Linux out-of-memory killer. – Itamar Turner-Trauring Nov 17 '21 at 15:10
0

Try

healthcheck:
  disable: true

Thanks to @thomas, I discovered that my python container was based on alpine, which recently suffers from a failing pidof sleep healthcheck. This led to the docker daemon killing the container with a 137 exit code but OOMKilled: false.

Those using docker run directly can add --no-healthcheck.

BrianTheLion
  • 2,618
  • 2
  • 29
  • 46
  • My solution was writing `/usr/bin/sleep` instead of just `sleep` so the shell would be forced to run the external sleep program and not use a shell builtin for it. – thomas Aug 25 '23 at 12:03