DRBD Stuck in WFBitMapS and/or WFBitMapT

This article can help you identify why DRBD is getting stuck while connecting in the WFBitMapS or WFBitMapT transient states.

Example `cat /proc/drbd` output:

1: cs:WFBitMapS ro:Primary/Secondary ds:UpToDate/Consistent A r-----
    ns:0 nr:0 dw:262651028 dr:53029299 al:390800 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:59040040
        resync: used:0/61 hits:0 misses:0 starving:0 locked:0 changed:0

This is most often caused by mismatched MTU sizes. To check for this, run the max_mtu_size.sh script that follows passing in the IP of the replication link of the peer, or manually check each hop along the replication path for mismatched MTUs:

#!/bin/bash 

LOW=100
HIGH=70000
TMP=0

PING_TCP_HEADER_SIZE=28
MAX_TRIES=33 # 2*log2($HIGH + HEADER - 1)

printf "\nSeeing if we can ping %s...\n\n" "$1"
if ! ping -c3 -i0.2 "$1"; then
printf "\nCan't ping %s at all.\n\n" "$1"
exit 1
fi

for ((i = 0; i < MAX_TRIES; i++ )); do
if ! ping -i0.2 -c3 -w1 -Mdo -s $HIGH "$1" &> /dev/null; then

TMP=$HIGH
HIGH=$(( ( HIGH + LOW ) / 2 ))
echo "$(( TMP + PING_TCP_HEADER_SIZE )) is too high, trying $(( HIGH + PING_TCP_HEADER_SIZE ))"
else
if [ $HIGH -eq $LOW ]; then
echo -e "\nLargest MTU size is $(( HIGH + PING_TCP_HEADER_SIZE )).\nPayload: $HIGH bytes.\tHeader: $PING_TCP_HEADER_SIZE bytes."
exit 0
fi
LOW=$HIGH
HIGH=$((TMP - 1))
echo "$(( LOW + PING_TCP_HEADER_SIZE )) might be too low, trying $(( HIGH + PING_TCP_HEADER_SIZE ))"
fi
done

echo "Failed to find max MTU size"
exit 1

The script above does a binary search to find the largest MTU that can be used when communicating with the IP passed to it.

If the largest packet size is less than the MTU configured on the NICs that the replication links are configured for, we can conclude that something in the network between the nodes is truncating packets and thus preventing DRBD® from completing its handshake.

 

Reviewed 2020/12/02 – DGT