I have a 23 node cluster running CoreOS Stable 681.2.0 on AWS across 4 availability zones. All nodes are running etcd2 and flannel. Of the 23 nodes, 8 are dedicated etcd2 nodes, the rest are specifically designated as etcd2 proxies.
Scheduled to the cluster are 3 nginx plus containers, a private Docker registry, SkyDNS, and 4 of our application containers. The application containers register themselves with with etcd2 and the nginx containers pick up any changes, render the necessary files, and finally reload.
This all works perfectly, until a singe etcd2 node is unavailable for any reason.
If the cluster of voting etcd2 members loses connectivity to a even a single other voting etcd2 member, all of the services scheduled to the fleet become unstable. Scheduled services begin stopping and starting without my intervention.
As a test, I began stopping the EC2 instances which host voting etcd2 nodes until quorum was lost. After the first etcd2 node was stopped, the above symptoms began. After a second node, services became unstable, with no observable change. Then, after the third was stopped quorum was lost and all units were unscheduled. I then started all three etcd2 nodes again and within 60 seconds the cluster had returned to a stable state.
Subsequent tests yield identical results.
Am I hitting a known bug in etcd2, fleet or CoreOS?
Is there a setting I can modify to keep units scheduled onto a node even if etcd is unavailable for any reason?