DRBD Stuck in `WFBitMapS` or `WFBitMapT` Transient States

This article can help you identify why DRBD® is getting stuck while connecting in the WFBitMapS or WFBitMapT transient states.

Example cat /proc/drbd output:

1: cs:WFBitMapS ro:Primary/Secondary ds:UpToDate/Consistent A r-----
    ns:0 nr:0 dw:262651028 dr:53029299 al:390800 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:59040040
        resync: used:0/61 hits:0 misses:0 starving:0 locked:0 changed:0

This is most often caused by mismatched MTU sizes. To check for this, run the max_mtu_size.sh script that follows passing in the IP of the replication link of the peer, or manually check each hop along the replication path for mismatched MTU sizes:

#!/bin/bash

LOW=100
HIGH=70000
TMP=0

PING_TCP_HEADER_SIZE=28
MAX_TRIES=33 # 2*log2($HIGH + HEADER - 1)

printf \nSeeing if we can ping %s...\n\n $1
if ! ping -c3 -i0.2 $1; then
printf \nCan't ping %s at all.\n\n $1
exit 1
fi

for ((i = 0; i < MAX_TRIES; i++ )); do
if ! ping -i0.2 -c3 -w1 -Mdo -s $HIGH $1 &> /dev/null; then

TMP=$HIGH
HIGH=$(( ( HIGH + LOW ) / 2 ))
echo $(( TMP + PING_TCP_HEADER_SIZE )) is too high, trying $(( HIGH + PING_TCP_HEADER_SIZE ))
else
if [ $HIGH -eq $LOW ]; then
echo -e \nLargest MTU size is $(( HIGH + PING_TCP_HEADER_SIZE )).\nPayload: $HIGH bytes.\tHeader: $PING_TCP_HEADER_SIZE bytes.
exit 0
fi
LOW=$HIGH
HIGH=$((TMP - 1))
echo $(( LOW + PING_TCP_HEADER_SIZE )) might be too low, trying $(( HIGH + PING_TCP_HEADER_SIZE ))
fi
done

echo Failed to find max MTU size
exit 1

The script above does a binary search to find the largest MTU that can be used when communicating with the IP passed to it.

If the largest packet size is less than the MTU configured on the NICs that the replication links are configured for, you can conclude that something in the network between the nodes is truncating packets and therefore preventing DRBD® from completing its handshake.

Reviewed 2020/12/02 – DGT