Reading Pacemaker Logs

Pacemaker is a suite of software not just a single component. Each component records to the cluster logs. Here is a brief description of each.

corosync

Corosync is responsible for communication between the nodes. Corosync will log when connections fail between host, and when cluster nodes join or leave the cluster.

Oct 15 00:52:26 [22107] fc-ha1 corosync error [TOTEM ] Marking ringid 0 interface 192.168.32.142 FAULTY
Oct 15 00:52:27 [22107] fc-ha1 corosync notice [TOTEM ] Automatically recovered ring 0
Oct 09 18:23:00 [22107] fc-ha1 corosync notice [TOTEM ] A new membership (192.168.32.130:24) was formed. Members joined: 2

cib

The cib is the cluster information base. This can somewhat be thought of as the cluster configuration, and it is effectively what you edit when you run 'crm configure edit'. However, the cib is much more just a configuration as it also stores the present state of the cluster as well. The cib is xml in it's raw format, but the crmsh and it's crm commands help make this more human readable.

When the cib logs it's usually done to show a quick diff of the changes to the cib. Below is an example of the cluster updating a node_state to record that node fc-ha1 is online.

Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_file_backup: Archived previous version as /var/lib/pacemaker/cib/cib-4.raw
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_process_request: Completed cib_delete operation for section //node_state[@uname='fc-ha1']/lrm:
OK (rc=0, origin=fc-ha1/crmd/24, version=0.4.0)
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_perform_op: Diff: --- 0.4.0 2
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_perform_op: Diff: +++ 0.4.1 (null)
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_perform_op: + /cib: @num_updates=1
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_perform_op: ++ /cib/status: <node_state id="1" uname="fc-ha1" in_ccm="true" crmd="online" crm-deb
ug-origin="do_lrm_query_internal" join="member" expected="member"/>
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_perform_op: ++ <lrm id="1">
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_perform_op: ++ <lrm_resources/>
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_perform_op: ++ </lrm>
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_perform_op: ++ </node_state>
Oct 09 17:38:18 [17976] fc-ha1 cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=fc-ha1/crm
d/25, version=0.4.1)

These logs aren't considered the most useful when it comes to troubleshooting a cluster, but they still can be useful to pin-point exactly when a cluster was put into standby or maintenance-mode.

pengine

The pengine is the Policy Engine. It is this components job to parse the cib and try to bring the cluster to the desired state. The pengine is event driven and only ran where there is a cluster event, or every 15 minutes (by default).

These logs will show the the current cluster state and the desired state. In the example below the current state is all resources are started on node fc-ha1 (common_print, clone_print, and short_print), and the desired state is simply to leave things running on fc-ha1 (LogActions).

Oct 09 18:34:26 [22127] fc-ha1 pengine: info: group_print: Resource Group: g_em7
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: common_print: p_fs_drbd1 (ocf::heartbeat:Filesystem): Started fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: common_print: mysql (ocf::sciencelogic:mysql-systemd): Started fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: common_print: postfix (systemd:postfix): Started fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: common_print: virtual_ip (ocf::heartbeat:IPaddr2): Started fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: common_print: drbdproxy (lsb:drbdproxy): Started fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: clone_print: Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: short_print: Masters: [ fc-ha1 ]
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: clone_print: Master/Slave Set: ms_drbd_r0-L [p_drbd_r0-L]
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: short_print: Masters: [ fc-ha1 ]
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: short_print: Slaves: [ fc-ha2 ]
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: master_color: Promoting p_drbd_r0-L:0 (Master fc-ha1)
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: master_color: ms_drbd_r0-L: Promoted 1 instances of a possible 1 to master
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: master_color: Promoting p_drbd_r0:0 (Master fc-ha1)
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: master_color: ms_drbd_r0: Promoted 1 instances of a possible 1 to master
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (20s) for p_fs_drbd1 on fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (120s) for mysql on fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (60s) for postfix on fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (30s) for virtual_ip on fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (15s) for drbdproxy on fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (29s) for p_drbd_r0:0 on fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (29s) for p_drbd_r0:0 on fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (29s) for p_drbd_r0-L:0 on fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (31s) for p_drbd_r0-L:1 on fc-ha2
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (29s) for p_drbd_r0-L:0 on fc-ha1
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: RecurringOp: Start recurring monitor (31s) for p_drbd_r0-L:1 on fc-ha2
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: LogActions: Leave p_fs_drbd1 (Started fc-ha1)
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: LogActions: Leave mysql (Started fc-ha1)
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: LogActions: Leave postfix (Started fc-ha1)
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: LogActions: Leave virtual_ip (Started fc-ha1)
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: LogActions: Leave drbdproxy (Started fc-ha1)
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: LogActions: Leave p_drbd_r0:0 (Master fc-ha1)
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: LogActions: Leave p_drbd_r0-L:0 (Master fc-ha1)
Oct 09 18:34:26 [22127] fc-ha1 pengine: info: LogActions: Leave p_drbd_r0-L:1 (Slave fc-ha2)

Grepping the logs for LogActions is a good way to quickly see when a failover or resource recovery first started. You can see when the cluster running and expected and when the policy engine decided to take action. In the example below we see the cluster was happy at 19:36, but then something occured at 19:46:

# grep LogAction /var/log/cluster/corosync.log | less
[...]
Oct 09 19:36:58 [22127] fc-ha1 pengine: info: LogActions: Leave p_fs_drbd1 (Started fc-ha1)
Oct 09 19:36:58 [22127] fc-ha1 pengine: info: LogActions: Leave mysql (Started fc-ha1)
Oct 09 19:36:58 [22127] fc-ha1 pengine: info: LogActions: Leave postfix (Started fc-ha1)
Oct 09 19:36:58 [22127] fc-ha1 pengine: info: LogActions: Leave virtual_ip (Started fc-ha1)
Oct 09 19:36:58 [22127] fc-ha1 pengine: info: LogActions: Leave drbdproxy (Started fc-ha1)
Oct 09 19:36:58 [22127] fc-ha1 pengine: info: LogActions: Leave p_drbd_r0:0 (Master fc-ha1)
Oct 09 19:36:58 [22127] fc-ha1 pengine: info: LogActions: Leave p_drbd_r0-L:0 (Master fc-ha1)
Oct 09 19:36:58 [22127] fc-ha1 pengine: info: LogActions: Leave p_drbd_r0-L:1 (Slave fc-ha2)
Oct 09 19:46:13 [22127] fc-ha1 pengine: notice: LogAction: * Stop p_fs_drbd1 ( fc-ha1 ) due to node availability
Oct 09 19:46:13 [22127] fc-ha1 pengine: notice: LogAction: * Stop mysql ( fc-ha1 ) due to node availability
Oct 09 19:46:13 [22127] fc-ha1 pengine: notice: LogAction: * Stop postfix ( fc-ha1 ) due to node availability
Oct 09 19:46:13 [22127] fc-ha1 pengine: notice: LogAction: * Stop virtual_ip ( fc-ha1 ) due to node availability
Oct 09 19:46:13 [22127] fc-ha1 pengine: notice: LogAction: * Stop drbdproxy ( fc-ha1 ) due to node availability
Oct 09 19:46:13 [22127] fc-ha1 pengine: notice: LogAction: * Demote p_drbd_r0:0 ( Master -> Slave fc-ha1 )
Oct 09 19:46:13 [22127] fc-ha1 pengine: info: LogActions: Leave p_drbd_r0-L:0 (Master fc-ha1)
Oct 09 19:46:13 [22127] fc-ha1 pengine: info: LogActions: Leave p_drbd_r0-L:1 (Slave fc-ha2)

crmd

The crmd is the cluster resource management daemon. The crmd is responsible for coordinating actions across the cluster. The more interesting crmd logs will almost always exist on the DC (designated controller node). The crmd is also responsible for the DC elections.

Oct 09 17:38:18 [17981] fc-ha1 crmd: notice: te_rsc_command: Initiating monitor operation mysql_monitor_0 locally on fc-ha1 | action 3

When troubleshooting a cluster the crmd logs are often useful to show when a particular cluster node has left or joined the cluster.

Oct 09 17:38:18 [17981] fc-ha1 crmd: info: join_make_offer: join-1: Sending offer to fc-ha1
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: crm_update_peer_join: join_make_offer: Node fc-ha1[1] - join-1 phase none -> welcomed
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: join_make_offer: Not making an offer to fc-ha2: not active ((null))
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks
Oct 09 17:38:18 [17981] fc-ha1 crmd: warning: do_log: Input I_ELECTION_DC received in state S_INTEGRATION from do_election_check
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: crm_update_peer_join: initialize_join: Node fc-ha1[1] - join-2 phase welcomed -> none
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: join_make_offer: join-2: Sending offer to fc-ha1
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: crm_update_peer_join: join_make_offer: Node fc-ha1[1] - join-2 phase none -> welcomed
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: join_make_offer: Not making an offer to fc-ha2: not active ((null))
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: do_dc_join_offer_all: join-2: Waiting on 1 outstanding join acks
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: update_dc: Set DC to fc-ha1 (3.0.14)
Oct 09 17:38:18 [17981] fc-ha1 crmd: info: crm_update_peer_expected: update_dc: Node fc-ha1[1] - expected state is now member (was (null))

lrmd

The local resource management daemon is responsible for the actual starting, stopping, and monitoring of resources local to a node. When troubleshooting issues with a cluster resources this is often the most useful logs. The lrmd logs will often include standard output of the commands called from the resource agent. This is often invaluable for troubleshooting. Below is an example of a DRBD® resource that timed out on a demote operation. The lrmd clearly shows us that DRBD refused to demote because the /dev/drbd device was still in use.

Oct 15 01:36:50 [22125] fc-ha1 lrmd: warning: child_timeout_callback: p_drbd_r0_demote_0 process (PID 22197) timed out
Oct 15 01:36:50 [22125] fc-ha1 lrmd: warning: operation_finished: p_drbd_r0_demote_0:22197 - timed out after 90000ms
Oct 15 01:36:50 [22125] fc-ha1 lrmd: notice: operation_finished: p_drbd_r0_demote_0:22197:stderr [ 1: State change failed: (-12) Device is held open by someone ]
Oct 15 01:36:50 [22125] fc-ha1 lrmd: notice: operation_finished: p_drbd_r0_demote_0:22197:stderr [ Command 'drbdsetup-84 secondary 1' terminated with exit code 11 ]

 

Reviewed 2021/01/27 - DJV