3

We have a bare metal Docker Swarm cluster, with a lot of containers.

And recently we have a full stop on the physical server.

The main problem, happened on Docker startup where all container tried to start on the same time.

I would like to know if there is a way to limit the amount of starting container? Or if there is another way to avoid overloading the physical server.

zerocewl
  • 11,401
  • 6
  • 27
  • 53
Daniel Asanome
  • 478
  • 5
  • 7
  • [Similar question](https://stackoverflow.com/questions/31746182/docker-compose-wait-for-container-x-before-starting-y) You can specify startup order: https://docs.docker.com/compose/startup-order/ – brokenfoot Dec 19 '18 at 19:37
  • @brokenfoot "The depends_on option is ignored when deploying a stack in swarm mode with a version 3 Compose file" https://docs.docker.com/compose/compose-file/#depends_on – BMitch Dec 20 '18 at 11:54
  • Ideally, when the node restarts, containers will have migrated to other nodes and nothing is configured to restart on the selected node. Are your manager nodes down or services constrained to a single node? – BMitch Dec 20 '18 at 11:56
  • @brokenfoot I'm using swarm mode, so saddly I can't use "depends_on", but I will try to make some "wait_for.sh". – Daniel Asanome Dec 20 '18 at 12:55
  • @BMitch Normally the container migrate to another node, but in that case all machines was down. And after the restart all containers tried to start at the same time, causing a huge load and delay in all container startup. – Daniel Asanome Dec 20 '18 at 12:58

1 Answers1

3

At present, I'm not aware of an ability to limit how fast swarm mode will start containers. There is a todo entry to add an exponential backoff in the code and various open issues in swarmkit, e.g. 1201 that may eventually help with this scenario. Ideally, you would have an HA cluster with nodes spread in different AZ's, and when one node fails, the workload would migrate to another node and you do not end up with one overloaded node.

What you can use are resource constraints. You can configure each service with a minimum CPU and memory reservation. This would prevent swarm mode from scheduling more containers on a node than it could handle during a significant outage. The downside is that some services may go unscheduled during an outage and you cannot prioritize which are more important to schedule.

BMitch
  • 231,797
  • 42
  • 475
  • 450