8

I have a small Mesos cluster and I'm using Marathon to manage a set of long-running services with a variable number of instances each.

I'd like to be able to launch new nodes or terminate some of them as required by business needs. However, when terminating a node I realized there is a potential problem: when I shut down a Mesos slave, it happens that the number of instances of some services falls temporarily below the defined minimumHealthCapacity. That can lead to some downtime if, for example, the machine to be stopped is running a service with only one instance.

Consider the following simplified scenario: node 1 is running service A, node 2 is running service B and node 3 is running service C. The minimumHealthCapacity for all services is 1. I want to terminate node 1 and leave only 2 and 3 running. I don't want any downtime on service A. An example of intended behavior would be to scale service A to 2 and then safely terminating node 1.

What can I do to make sure no service falls below the minimumHealthCapacity?

Ideally, I would have a rolling-update inspired process for that - replacements are launched in separate machines, followed by the termination of the services in the machine to be shut down. I would like to have at least an automated process to do that, so that a scale down is a simple script away. I have no requirement for the amount of time it takes to do that, i.e. I can shut down the Mesos slave only after I'm sure the Marathon migration is finished and successful.

Rui Gonçalves
  • 1,355
  • 12
  • 28

1 Answers1

1

The Mesos dev team is currently working on "Maintenance Primitives" so that an operator can indicate that a particular machine is scheduled to go down at a certain time (or ASAP), triggering messages to each framework notifying them of the intended unavailability window. A framework like Marathon could then decide to migrate its tasks away from that node so that it can be safely terminated without any service downtime.

See https://issues.apache.org/jira/browse/MESOS-1474 for more details/patches.

Adam
  • 4,322
  • 1
  • 16
  • 22
  • In the meantime, your best option is to scale up the instances count for your app(s) by 1, wait for the new instance to be healthy, kill a node, then scale it back down by 1. – Adam Aug 04 '15 at 03:35
  • Thank you for the response, I'll do that and follow the updates on Jira! Any idea about how long it will be until this feature is implemented? – Rui Gonçalves Aug 05 '15 at 12:42
  • The feature is actively in progress, but will probably take another month or two before everything lands. Keep an eye on that JIRA ticket for more info, and maybe volunteer for one of the subtasks if you'd like to see it accelerated. ;) – Adam Aug 19 '15 at 08:07