4

I have a flask application that allows users to start long running tasks (sometimes > 1d) via a celery job queue. The flask application and all its dependencies including the celery workers are containerized via docker and start with a docker-compose file.

My problem is that when I update the container images with a new version of the application software I need to restart the containers with:

docker-compose down
docker-compose up -d

This will cancel all long running jobs as there is only a short timeout value per default in docker-compose. Setting a longer timeout value for a graceful stop by docker-compose as suggested in docker-compose and graceful Celery shutdown does not work for me, as there is no way to predict how long the jobs will take and the update might take very long until all tasks are finished.

My idea was to somehow detach the running container from the docker-compose control, and then issue a graceful shutdown of celery inside the detached container, which then allows the jobs to finish, but does not accept new jobs. Then I could start the normal containers stack via docker-compose up -d.

Thus I would like to do:

  • remove/rename celery container from docker compose
  • signal the celery task in the container to stop gracefully and let the jobs finish but not accept new jobs
  • then start the new containers that will accept new jobs

I tried to use docker rename to rename containers started by docker-compose, but they still react to docker-compose down.

My question is if this approach is the correct way to handle this and if this is even possible with docker-compose? What would be the best practice to handle graceful updates of celery workers with long running tasks in a docker-compose environment?

Other questions that I found that are related but do not solve the problem entirely:

docker-compose and graceful Celery shutdown : the answer shows how to stop the containers gracefully but I want to start a new celery worker immediately to have no down time.

How do I restart celery workers gracefully?: This works for a local installation, but I have to restart the containers to get the new application code.

EDIT: New hints to the solution:

In this issue I've found a similar situation. Here docker-compose --scale is used to duplicate a service then one can find the IDs off the old and the new service. Once the new service is up one should be able to tell celery to shutdown and finish the executing tasks in the old container. If this is the solution I will add this as an answer later.

https://github.com/docker/compose/issues/1786#

EDIT: Thinking more about the variant with scaling. Here again I have the problem with the long running tasks. It would be cumbersome to watch the dying container until I can scale back to 1 instance. In the example in the link it was only important to check that the new service is really up before stopping the old one, so that the script could scale back to a single instance immediately. I would rather duplicate the service but remove the new service from control of docker-compose so that it wont get killed when I scale back to 1 container. This must be possible by removing the docker-compose labels of the running container:

"Labels": {
                "com.docker.compose.config-hash": "44e0bbd2a10e28bcad071a42315e65ed4d89f2d815a08aed4f3133b05b9d9f71",
                "com.docker.compose.container-number": "1",
                "com.docker.compose.oneoff": "False",
                "com.docker.compose.project": "karmada_docker_upgreat",
                "com.docker.compose.project.config_files": "docker-compose_test.yml",
                "com.docker.compose.project.working_dir": "/home/USERNAME/git/karmada_docker_upgreat",
                "com.docker.compose.service": "karmada_celery_kalibrate_worker",
                "com.docker.compose.version": "1.25.0"
            }

Or is this the wrong track? Renaming the service does not make a difference to docker-compose.

** EDIT ** Labels can not be changed for a running container: https://github.com/moby/moby/issues/15496 The more I think about this I think I will have to use normal docker commands to run the celery containers. With docker commands and a shell script it would be easy to achieve what I need to do. I would still like to see a solution in docker-compose.

Ascurion
  • 503
  • 3
  • 13
  • Check [this issue](https://github.com/docker/compose/issues/1294) – Xaqron Apr 22 '20 at 06:24
  • thanks for the suggestion, but unless I overlooked something, selective starting/stopping of services by their names does not help me, because I basically have to start an updated clone of the celery service that is running and allow the running container to finish what it is doing. Or am I seeing this wrong? – Ascurion Apr 22 '20 at 06:54
  • I'm not familiar with celery. Is it listening on a port for doing its job? If yes then for running more than one instance you need some kind of a load balancer (docker has many good networking ideas so there is a chance it provide it out of the box). Despite this issue your approach seams reasonable. – Xaqron Apr 22 '20 at 07:08
  • celery connects to a broker, `redis` in my case to receive jobs, the running tasks talks to a database to update status informations. I can actually start multiple services with the `--scale` option of docker-compose. My problem is to scale back to a single instance after the original container died. However since the jobs can be long running, it is likely that a second update comes in, so that I would have 3, or more parallel instances running. That is why I would like to detach the container and scale back to 1 immediately... – Ascurion Apr 22 '20 at 07:16
  • As a general solution you can implement a [Round-robin](https://en.wikipedia.org/wiki/Round-robin_scheduling) to shift between multiple containers (say 2 with 1-2 hrs or whatever is appropriate). I'm afraid you cannot do all of that with docker-compose and need some shell script to take care of that single container manually. Just make sure its dependencies are running on its first run. – Xaqron Apr 22 '20 at 07:25
  • I think that this is not the solution for this problem. I only need to let one container finish what it is doing when an update for the container software is due. There is no shortage of resources. Also I can't say how long the jobs may take. They could even take a day to finish. – Ascurion Apr 22 '20 at 20:13

1 Answers1

4

After much more research I found a solution to this problem. But I had to drop the constraint of using docker-compose.

Currently, I think that what I need to do is not possible with docker-compose because a container once started with docker-compose will always be controlled by docker-compose commands as long as it is online. The reason is that labels can not be changed on running containers and docker-compose finds the container it controls via the labels (see question for details).

So although one could use:

docker-compose up -d --no-deps --scale $SERVICE_NAME=2 --no-recreate $SERVICE_NAME

to start an updated container an leave the current one running, as suggested here:

https://github.com/docker/compose/issues/1786#

I have no means to scale the services back after the long running job finished. Because the jobs might run very long (> 1d) I could have multiple containers finishing up. Thus I would have to implement a massive overhead to count the containers that are currently finishing up and re-scaling back to the appropriate number when one of them is done. Always with the danger that an accidental docker-compose down would take them all down.

But the shell script towards the end of https://github.com/docker/compose/issues/1786# motivated me to drop the docker-compose contstraint and control all celery containers with normal docker commands. With this it is easy to manage what I wanted to do. I came up with the following shell script:

startup () {
  SERVICE_NAME=${1?"Usage: docker_update <SERVICE_NAME> <COMMAND>"}
  COMMAND=${2?"Usage: docker_update <SERVICE_NAME> <COMMAND>"}
  docker run \
         -d \
         --name $SERVICE_NAME \
         SOME_DOCKER_IMAGE \
         $COMMAND
}

update () {
  SERVICE_NAME=${1?"Usage: docker_update <SERVICE_NAME> <COMMAND>"}
  COMMAND=${2?"Usage: docker_update <SERVICE_NAME> <COMMAND>"}
  echo "[INFO] Updating docker service $SERVICE_NAME"
  OLD_CONTAINER_ID=$(docker ps --format "table {{.ID}}  {{.Names}}  {{.CreatedAt}}" | grep $SERVICE_NAME | tail -n 1 | awk -F  "  " '{print $1}')
  OLD_CONTAINER_NAME=$(docker ps --format "table {{.ID}}  {{.Names}}  {{.CreatedAt}}" | grep $SERVICE_NAME | tail -n 1 | awk -F  "  " '{print $2}')

  TEMP_UUID=`uuidgen`
  TEMP_CONTAINER_NAME="celery_worker_${TEMP_UUID}"

  echo "[INFO] rename $OLD_CONTAINER_NAME to $TEMP_CONTAINER_NAME"
  docker rename $OLD_CONTAINER_NAME $TEMP_CONTAINER_NAME

  echo "[INFO] start new/updated celery queue"
  startup $SERVICE_NAME $COMMAND

  echo "[INFO] send SIGTERM to $TEMP_CONTAINER_NAME for warm shutdown"
  docker kill --signal=SIGTERM $TEMP_CONTAINER_NAME

#  Optional waiting for the container to finish
  echo "[INIT] waiting for old docker container to finish"
  docker wait $TEMP_CONTAINER_NAME
}

SERVICE_NAME=${1?"Usage: docker_update <SERVICE_NAME>"}
COMMAND=${2?"Usage: docker_update <SERVICE_NAME> <COMMAND>"}
echo "[INFO] checking if this service already runs"
docker ps --format "table {{.ID}}  {{.Names}}  {{.CreatedAt}}" | grep $SERVICE_NAME

if [ $? -eq 0 ]
then
  echo "[INFO] CONTAINER with name $SERVICE_NAME is online -> update"
  update $SERVICE_NAME $COMMAND
else
  echo "[INFO] CONTAINER with name $SERVICE_NAME is **not** online -> starting"
  startup $SERVICE_NAME $COMMAND
fi

The script checks if a service with the given name is running. If it is not it starts it. If it is running, it renames the currently running container, then starts a new (possibly updated) one, and sends a SIGTERM to the old one. For celery this is the signal to do a warm shutdown which means it does not accept new tasks anymore but finishes the ones it is currently executing and then exits. If no task is running it exits immediately. The new celery worker takes over all new tasks.

Ascurion
  • 503
  • 3
  • 13