CoreOS, Fleet and Etcd2 fault tolerance

Question

I have a 23 node cluster running CoreOS Stable 681.2.0 on AWS across 4 availability zones. All nodes are running etcd2 and flannel. Of the 23 nodes, 8 are dedicated etcd2 nodes, the rest are specifically designated as etcd2 proxies.

Scheduled to the cluster are 3 nginx plus containers, a private Docker registry, SkyDNS, and 4 of our application containers. The application containers register themselves with with etcd2 and the nginx containers pick up any changes, render the necessary files, and finally reload.

This all works perfectly, until a singe etcd2 node is unavailable for any reason.

If the cluster of voting etcd2 members loses connectivity to a even a single other voting etcd2 member, all of the services scheduled to the fleet become unstable. Scheduled services begin stopping and starting without my intervention.

As a test, I began stopping the EC2 instances which host voting etcd2 nodes until quorum was lost. After the first etcd2 node was stopped, the above symptoms began. After a second node, services became unstable, with no observable change. Then, after the third was stopped quorum was lost and all units were unscheduled. I then started all three etcd2 nodes again and within 60 seconds the cluster had returned to a stable state.

Subsequent tests yield identical results.

Am I hitting a known bug in etcd2, fleet or CoreOS?

Is there a setting I can modify to keep units scheduled onto a node even if etcd is unavailable for any reason?

You have 8 dedicated voting etcd2 nodes (no applications running on these hosts)? 8 is a weird number, but not a problem (5, 7 or 9 would be better). Can you shared the configuration for a sample voting etcd2, and an example proxy etcd2? The logs would be very useful (journalctl --system) for the time period when instability happens from a few affected hosts. Can each proxied etcd2 reach *all* voting etcd2? Is it possible that your voting etcd2 are being demoted to proxy during discovery? — Greg, Jul 07 '15 at 13:02

score 2 · Answer 1 · answered Aug 11 '15 at 13:32

2

I've experienced the same thing. In my case, when I ran 1 specific unit it caused everything to blow up. Scheduled and perfectly fine running units were suddenly lost without any notice, even machines dropping out of the cluster.

I'm still not sure what the exact problem was, but I think it might have had something to do with etcd vs etcd2. I had a dependency of etcd.service in the unit file, which (I think, not sure) caused CoreOS to try and start etcd.service, while etcd2.service was already running. This might have caused the conflict in my case, and messed up the etcd registry of units and machines.

Something similar might be happening to you, so I suggest you check each host whether you're running etcd or etcd2 and check your unit files to see which one they depend on.

answered Aug 11 '15 at 13:32

yp28

147
11

In fact, when using ectd2, all units and dependencies must not require etcd, or it will make things go wrong. – ericson.cepeda Sep 22 '15 at 19:24
@ericson.cepeda, please explain why it will make things go wrong. In my opinion, it's tricky, but not impossible. Etcd2 has backwards compatibility support, so new nodes running etcd2 can join existing clusters with nodes running etcd (if I'm not mistaken). However, when scheduling units, it may cause conflicts when the unit requires etcd. I think a workaround would be to apply metadata regarding the etcd version to the CoreOS node and schedule the unit on a node that's running the etcd version specified in the unit file. – yp28 Sep 23 '15 at 09:52
Yes, nodes can still communicate, but as you said, running units on the same node with different etcd versions cause conflicts, indeed: https://github.com/coreos/docs/issues/528 – ericson.cepeda Sep 24 '15 at 04:41
Example for this behaviour: https://github.com/coreos/etcd/issues/3103#issuecomment-119293975 – Tarnschaf Jan 07 '17 at 04:45

CoreOS, Fleet and Etcd2 fault tolerance

1 Answers1