Fencing considerations, quorum settings, and commands for sensible behavior in a two-node Pacemaker with Corosync cluster
Goals for a two-node cluster:
- Do not go online with stale data (replication case).
- Do not cause a "startup fencing loop".
- Run Pacemaker "primitive" services exactly once (prevent IP conflicts, data corruption, and other issues).
To be allowed to start services, a node needs to be "quorate", that is, the node needs to be a member of a cluster partition that has quorum. The node also must be certain that the respective Pacemaker "primitive" service is not (and cannot possibly be) running anywhere else.
Methodology:
Setting `two_node: 1` within the quorum section of a Corosync configuration file enables two-node cluster operations. Enabling this setting automatically enables Corosync's `wait_for_all` quorum option. This is sensible behavior because quorum in a two-node cluster is two nodes (50% of the votes + one). So on startup, a node will always wait for the other node, and only then become ready to provide services.
Pacemaker will then start, see both nodes in the membership, will "probe" for current service status, and try to change the "state of the world" using start, stop, or possibly other actions based on configured policy.
Fencing Considerations in a Two-node Cluster
If the two-node cluster now loses a node, the other node will continue to run services, or take over services. In a two-node cluster, this presents a special problem as the two-node cluster does not have real quorum (a simple majority in a cluster with an odd number of three or more nodes). If the two-node cluster lost a node because of a "communication problem", both nodes are alive, but from either node, the other node will appear to be unresponsive, and so each node needs to "fence" the other node before taking any further action. After a successful "fencing operation", services cannot possibly run on the fenced node.
In this scenario, typically one of the nodes won the race to fence the other node, and so the other node is rebooting, due to Pacemaker options that you would have configured as part of a typical setup.
After the reboot, if communication is still down between the nodes and without the "wait for all" behavior, the node would, after some timeout perhaps, start Pacemaker. However, because communication between the nodes is down, the newly rebooted node would "think" that the other node is unresponsive, and try to "fence" the other node that you, as the omniscient global observer, know is happily running services.
This pattern would repeat with changing roles, each time with the newly rebooted node fencing the other node.
You might consider not using fencing for your two-node cluster. However, without fencing, and used-to-be replicated data, but no communication, you would get diverging data sets.
Without fencing, and with "shared" data, you would get data corruption.
With proper fencing configured on both the Pacemaker and DRBD® levels, you might get successful STONITH behavior, but then DRBD would still refuse to take over with "only consistent" or "outdated" data on one of the nodes.
With the implicit "wait for all", Pacemaker will not start, and so the newly rebooted node will not become quorate until communication with the peer has been reestablished. This avoids the startup-and-fence-the-other-node repeating loop.
The remainder of this article repeats and rephrases the above information, with some subtleties and some additional commands that you may want to use in certain circumstances. If you have understood the startup behavior of a two-node Pacemaker cluster from the article so far, you can stop reading here, or else continue reading to reinforce and deepen your understanding.
To rephrase:
In Pacemaker with Corosync two-node clusters, you should use the `two_node: 1` quorum setting in your Corosync configuration file. Remember from earlier discussion that this Corosync configuration setting effects an implicit "wait for all" behavior for each node.
You should also consider that `no-quorum-policy=stop` is the default setting in Pacemaker if you have not configured it differently in your Pacemaker configuration. Although if you have configured `no-quorum-policy=freeze`, the behavior described in this article in a two-node cluster will be the same as for `no-quorum-policy=stop`.
So after startup, and without communication to the other node, the newly rebooted node does NOT have quorum, and without quorum, it will not fence the other node or start anything, or do anything else, really, because of the `stop` (or `freeze`) `no-quorum-policy` setting.
Once communication is reestablished between the two nodes and the newly rebooted node has seen its peer (and so become quorate), you then can lose the peer (and, because of the `two_node` Corosync quorum setting, keep quorum).
Should you ever need to bring up an isolated single node, you can then explicitly cancel the initial "wait for all" stage, at runtime with the following command:
corosync-cmapctl -s quorum.cancel_wait_for_all u8 1
Of course, before doing this, you should confirm that this is the right thing to do in your specific situation by following some documented administrative best practices procedure that you should have in place around your data or services.
But "properly configured DRBD" may still prevent your node from going online, if your node "suspects" that its peer might have better data. If you, as an omniscient global observer, know better, then you can use the `drbdadm primary --force` command to manually try to have the node go online with outdated or possibly stale data, or even with "inconsistent, but hopefully just by a little bit" data. (An `fsck` command is strongly recommended here!)
To reiterate:
The key point is that in a two-node cluster, as in this case, Corosync (the communication and membership layer of Pacemaker clusters) is configured for `two_node` quorum behavior, which implies `quorum.wait_for_all`, as in the `corosync-cmapctl` command above.
That means that you can shut down one node from a two-node cluster, keep the other node running, and that should work just fine.
However, if you then chose to stop that single node as well, and restart it as an isolated single node, the `wait_for_all` Corosync setting will block the node from starting services because it will be waiting for its peer.
This behavior is by design, and in general a good thing.
If you actually mean to bring up that single isolated node, and you know it has good data, and you know the other node is down, then you can explicitly cancel the initial "wait for all" stage with this command:
corosync-cmapctl -s quorum.cancel_wait_for_all u8 1
To re-reiterate:
The recommended Corosync quorum setting for a two-node Pacemaker with Corosync cluster is `two_node: 1`, which automatically enables an implicit `wait_for_all: 1` quorum setting.
The consequence is that you have to bring both nodes up, and in communication with each other, before the cluster will start services.
You then can lose either node, so long as the other node keeps running.
If you have to boot a single, isolated node, and you know that this node has the most recent and good data, and you know that the other node is down, and will stay down, and you now want this node to start services, without bringing up the other node, you can cancel the wait, using the `corosync-cmapctl` command above.
If you disable the `wait_for_all` Corosync quorum setting, and set `no-quorum-policy=ignore` in your Pacemaker configuration, and get into a situation where fencing does work, but the cluster communication does not, then you may end up with two nodes repeatedly rebooting and fencing each other.
This is why this is NOT the default setup.
This situation could be mitigated by not starting the cluster software on regular boot. But that would always require operator interaction after a reboot for any reason.
Created by MAT (based on original content by LE) - 2022-07-21
Reviewed by DJV 2022-07-25