2

I want to make a cluster of 3 Percona Xtradb+application servers in ec2 using AutoScaling groups, so that if some server fails for some reason, it can be shut down and ASG would then restart the server getting all the current data from the other 2 working servers.

So to implement this I've made 3 instances (A, B, and C) and on initial startup instance A tests port 4567 of instances B and C and if on any of them this port is open, Xtradb is started with proper wsrep_cluster settings, SST is fetched from the running instance.

If that port is closed on both instances, A starts with wsrep_cluster=gcomm:// so it becomes the "origin" of the cluster, thinking instances B and C were simply never started yet, waiting for them to connect later.

The problem is, if instances B and C are running, but A can't connect to them on launch, "split brain" is going to occur. How do I avoid this situation?

Fluffy
  • 27,504
  • 41
  • 151
  • 234

1 Answers1

0

If A cannot talk to B and C when A starts up, then A will bootstrap. You won't really have split brain. You would have two separate clusters. You would have existing data on B/C and no data on A.

You probably need service discovery, something like Consul or etcd, to function as 'source of truth' for the status of your cluster in an automated fashion, like you are trying to achieve. On startup for each node, contact Consul and look for a key-pair representing any nodes. If none, bootstrap and then register with discovery service. Each node, once online, should have a regular update to srv disc saying "I'm still here".

The real problem occurs when all nodes go down and ASG has to rebuild all of them. Where does the data come from in this case? There would not be any. This is one of the biggest downsides to automated configurations like this. It would be better for you just to have proper monitoring for when a node goes offline so you can take smarter actions.

utdrmac
  • 731
  • 5
  • 17