DRBD - automatic recover after disconnect

Question

I have High availability cluster that configured with DRBD resource.

 Master/Slave Set: RVClone01 [RV_data01]
     Masters: [ rvpcmk01-cr ]
     Slaves: [ rvpcmk02-cr ]

I perform a test that disconnect one of the network adapter that connect between the DRBD network interfaces (for example shutdown the network adapter). Now the cluster display statuses that everything o.k BUT the status of the DRBD when running "drbd-overview" shows in primary server:

[root@rvpcmk01 ~]# drbd-overview
 0:drbd0/0  WFConnection Primary/Unknown UpToDate/DUnknown /opt ext4 30G 13G 16G 45%

and in the secondary server:

[root@rvpcmk02 ~]# drbd-overview
 0:drbd0/0  StandAlone Secondary/Unknown UpToDate/DUnknown

Now I have few questions: 1. Why cluster doesn't know about the problem with the DRBD? 2. Why when I put the network adapter that was down to UP again and connect back the connection between the DRBD the DRBD didn't handle this failure and sync back the DRBD when connection is o.k? 3. I saw an article that talk about "Solve a DRBD split-brain" - https://www.hastexo.com/resources/hints-and-kinks/solve-drbd-split-brain-4-steps/ in this article it's explain how to get over a problem of disconnection and resync the DRBD. BUT how I should know that this kind of problem exist?

I hope I explain my case clearly and provide enough information about what I have and what I need...

score 1 · Answer 1 · answered Mar 30 '17 at 22:58

1

1) You aren't using fencing/STONITH devices in Pacemaker or DRBD, which is why nothing happens when you unplug your network interface that DRBD is using. This isn't a scenario that Pacemaker will react to without defining fencing policies within DRBD, and STONITH devices within Pacemaker.

2) You likely are only using one ring for the Corosync communications (the same as the DRBD device), which will cause the Secondary to promote to Primary (introducing a split-brain in DRBD), until the cluster communications are reconnected and realize they have two masters, demoting one to Secondary. Again, fencing/STONITH would prevent/handle this.

3) You can set up the split-brain notification handler in your DRBD configuration.

Once you have STONITH/fencing devices setup in Pacemaker, you would add the following definitions to your DRBD configuration to "fix" all the issues you mentioned in your question:

resource <resource>
  handlers {
    split-brain "/usr/lib/drbd/notify-split-brain.sh root";
    fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
    after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
    ...
  }
  disk {
    fencing resource-and-stonith;
    ...
  }
  ...
}

Setting up fencing/STONITH in Pacemaker is a little too dependent on your hardware/software for me to give you pointers on setting that up for your cluster. This should get you pointed in the right direction: http://clusterlabs.org/doc/crm_fencing.html

Hope that helps!

answered Mar 30 '17 at 22:58

Matt Kereczman

477
4
9

It's helps a lot and since I tried to follow what you said and read about the fencing/STONITH but still don't understand someting, is setting up fencing/STONITH in Pacemaker depend on my phisical servers? are my servers need to support fencing? I saw in a few places the term "fencing device", what does it means? – Lidor Aviman Apr 02 '17 at 08:11
A fencing device (specifically a STONITH device in this case) is something that can take a server in an unknown state, and put it into a known state (in this case, that state is powered off), therefore allowing the remaining cluster nodes to safely continue performing it's start/stop/monitor tasks. This can be done with out of band remote management interfaces like DRAC, ILO, IPMI, etc. Or, you can use something like smart PDU/UPS to cut power on a specific port/outlet. You should really read the clusterlabs.org link I provided in my answer; you'll see there are many ways to implement. – Matt Kereczman Apr 03 '17 at 17:33
I read what you sent and also implement stonith/fenicng resource using this guide: https://www.lisenet.com/2015/active-passive-cluster-with-pacemaker-corosync-and-drbd-on-centos-7-part-4/#comment-1180 – Lidor Aviman Apr 04 '17 at 11:49
After I configured the stonith/fencing resourse it started properly but after few minutes it's stopped. Please advise what can be wrong. my_vcentre-fence (stonith:fence_vmware_soap): Stopped – Lidor Aviman Apr 04 '17 at 11:56
Did you try running the `fence_vmware_soap` command manually to list the VMs vCenter is running as the guide you linked to suggests doing? If you didn't, I would highly recommend that as it should give you some more meaningful errors. If you can't get it to work manually, Pacemaker won't be able to either. – Matt Kereczman Apr 04 '17 at 19:31
I did it and it's looks o.k, except a warning message about certification: `[root@rvpcmk01 ~]# fence_vmware_soap --ip 172.17.235.96 --ssl --ssl-insecure --action list --username="administrator@vSphere.local" --password="Radview1@" | grep rvpcmk` the output of the command in the next comment... – Lidor Aviman Apr 05 '17 at 06:49
`/usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html InsecureRequestWarning) CentOS7-Minimal-rvpcmk01,564dc8c1-a1e9-0151-a5ce-839216def6be rvpcmk02,564d3cd7-5a7f-35cc-da79-d5dc90f485c7 CentOS7-Minimal-rvpcmk02,564d0b46-a8c8-985f-bc6b-d4e135c06a87` Maybe it's cause me the problems? the missing certification? – Lidor Aviman Apr 05 '17 at 06:49
So finally I succeed to configure my stonith resource properly but still after I added the lines to DRBD confgiuration the behavior isn't good enough. when I shut down the DRBD interface all my resources stopped and when I start the DRBD interface it's not resync and get back to the previous behavior. – Lidor Aviman Apr 06 '17 at 13:34
Your `no-quorum-policy` is likely still set to `stop`, so when you fence the peer, your surviving node is stopping resources because it doesn't have quorum... You need to be asking more concise questions instead of asking, "how to setup an entire cluster with all the best practices". You should check out some guides from reputable sources on clustering like Clusterlabs.org or linbit.com. Then, when you have a more specific question, come back here and ask your questions. – Matt Kereczman Apr 06 '17 at 15:00

DRBD - automatic recover after disconnect

1 Answers1