Tips for Using ZFS Layered Over DRBD with Pacemaker

ZFS is a popular modern file system and using it layered on top of DRBD® requires some special considerations. This article will describe some considerations and requirements for building a system with ZFS over DRBD managed by Pacemaker.

Prerequisites

DRBD® 9.x resource is created and started on all nodes.
Pacemaker and Corosync are configured and started on all nodes.
ZFS is installed for your distribution (ZFS “Getting Started” documentation.

DRBD Specific Requirements for ZFS

For DRBD to be used underneath ZFS, we must disable DRBD 9’s auto-promotion feature. ZFS doesn’t hold the device open in the kernel the same way other file systems or processes in Linux do, and therefore will not cooperate with DRBD’s auto-promote features.

resource r0 {
  device    /dev/drbd0;
  disk      /dev/sdb;
  meta-disk internal;

  options {
    auto-promote no;
  }

  on zfs-0 {
    address   192.168.222.20:7777;
    node-id   0;
  }

  on zfs-1 {
    address   192.168.222.21:7777;
    node-id   1;
  }

  on zfs-2 {
    address   192.168.222.22:7777;
    node-id   2;
  }

  connection-mesh {
        hosts zfs-0 zfs-1 zfs-2;
  }
}

Also, if you’re planning on using multiple DRBD devices to create a zpool, you will want to use a multi-volume DRBD configuration.

resource r0 {
  volume 0 {
    device minor 0;
    disk /dev/sdb;
    meta-disk internal;
  }

  volume 1 {
    device minor 1;
    disk /dev/sdc;
    meta-disk internal;
  }

  on zfs-0 {
    address   192.168.222.20:7777;
    node-id   0;
  }

  on zfs-1 {
    address   192.168.222.21:7777;
    node-id   1;
  }

  on zfs-2 {
    address   192.168.222.22:7777;
    node-id   2;
  }

  connection-mesh {
        hosts zfs-0 zfs-1 zfs-2;
  }
}

Creating and Using `zpool`s Layered on Top of DRBD

Once your DRBD resource is created, promote it on a single node and begin creating your zpool.

# drbdadm primary r0

# zpool create new-pool /dev/drbd0

# zpool status
  pool: new-pool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        new-pool    ONLINE       0     0     0
          drbd0     ONLINE       0     0     0

errors: No known data errors

You should now see you have the ZFS file system mounted at /new-pool.

# mount | grep new-pool
new-pool on /new-pool type zfs (rw,xattr,noacl)

Test putting something into the mount point, and then export (umount and stop) the zpool, and demote the DRBD device.

# zpool export -f new-pool
# zpool status
no pools available

# mount | grep new-pool

# drbdadm secondary r0

Now the DRBD device can be promoted on a different node, and the zpool can be imported and used there. You should see whatever data you placed into /new-pool on the previous node has been replicated to all peers.

# drbdadm primary r0
# zpool import -o cachefile=none new-pool

# zpool status
  pool: new-pool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        new-pool    ONLINE       0     0     0
          drbd0     ONLINE       0     0     0

errors: No known data errors

# mount | grep new-pool
new-pool on /new-pool type zfs (rw,xattr,noacl)

Adding a DRBD-Backed `zpool` to Pacemaker

To make a zpool backed by DRBD a part of Pacemaker configuration you need verify that the DRBD device backing the zpool gets started and promoted before the zpool is imported. You also need verify that the zpool is imported on the node where DRBD has been promoted.

In Pacemaker specific terms, the zpool must be colocated with the DRBD primary, and ordered to start only after DRBD is promoted.

The following examples satisfy these requirements in both crmsh and pcs configurations, respectively:

primitive p_drbd_r0 ocf:linbit:drbd \
        params drbd_resource=r0 \
        op start interval=0s timeout=240 \
        op promote interval=0s timeout=90 \
        op demote interval=0s timeout=90 \
        op stop interval=0s timeout=100 \
        op monitor interval=29 role=Master \
        op monitor interval=31 role=Slave
primitive p_zfs ZFS \
        params pool=new-pool \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=60s \
        op monitor interval=20 timeout=40s
ms ms_drbd_r0 p_drbd_r0 \
        meta master-max=1 master-node-max=1 notify=true clone-node-max=1 clone-max=3
colocation cl_p_zfs-with-ms_drbd_r0 inf: p_zfs:Started ms_drbd_r0:Master
order o_ms_drbd_r0-before-p_zfs ms_drbd_r0:promote p_zfs:start

<resources>
  <primitive id=p_zfs class=ocf provider=heartbeat type=ZFS>
    <instance_attributes id=p_zfs-instance_attributes>
      <nvpair name=pool value=new-pool id=p_zfs-instance_attributes-pool/>
    </instance_attributes>
    <operations>
      <op name=start interval=0 timeout=60s id=p_zfs-start-0/>
      <op name=stop interval=0 timeout=60s id=p_zfs-stop-0/>
      <op name=monitor interval=20 timeout=40s id=p_zfs-monitor-20/>
    </operations>
  </primitive>
  <master id=ms_drbd_r0>
    <meta_attributes id=ms_drbd_r0-meta_attributes>
      <nvpair name=master-max value=1 id=ms_drbd_r0-meta_attributes-master-max/>
      <nvpair name=master-node-max value=1 id=ms_drbd_r0-meta_attributes-master-node-max/>
      <nvpair name=notify value=true id=ms_drbd_r0-meta_attributes-notify/>
      <nvpair name=clone-node-max value=1 id=ms_drbd_r0-meta_attributes-clone-node-max/>
      <nvpair name=clone-max value=3 id=ms_drbd_r0-meta_attributes-clone-max/>
    </meta_attributes>
    <primitive id=p_drbd_r0 class=ocf provider=linbit type=drbd>
      <instance_attributes id=p_drbd_r0-instance_attributes>
        <nvpair name=drbd_resource value=r0 id=p_drbd_r0-instance_attributes-drbd_resource/>
      </instance_attributes>
      <operations>
        <op name=start interval=0s timeout=240 id=p_drbd_r0-start-0s/>
        <op name=promote interval=0s timeout=90 id=p_drbd_r0-promote-0s/>
        <op name=demote interval=0s timeout=90 id=p_drbd_r0-demote-0s/>
        <op name=stop interval=0s timeout=100 id=p_drbd_r0-stop-0s/>
        <op name=monitor interval=29 role=Master id=p_drbd_r0-monitor-29/>
        <op name=monitor interval=31 role=Slave id=p_drbd_r0-monitor-31/>
      </operations>
    </primitive>
  </master>
</resources>
<constraints>
  <rsc_colocation id=cl_p_zfs-with-ms_drbd_r0 score=INFINITY rsc=p_zfs rsc-role=Started with-rsc=ms_drbd_r0 with-rsc-role=Master/>
  <rsc_order id=o_ms_drbd_r0-before-p_zfs first=ms_drbd_r0 first-action=promote then=p_zfs then-action=start/>
</constraints>

After configuring and committing your changes to Pacemaker you should have a simple ZFS fail-over cluster.

# crm_mon -1rD
Node List:
  * Online: [ zfs-0 zfs-1 zfs-2 ]

Full List of Resources:
  * p_zfs       (ocf::heartbeat:ZFS):    Started zfs-2
  * Clone Set: ms_drbd_r0 [p_drbd_r0] (promotable):
    * Masters: [ zfs-2 ]
    * Slaves: [ zfs-0 zfs-1 ]

Reviewed by MDK – 2022/4/20

Tips for Using ZFS Layered Over DRBD with Pacemaker

Prerequisites

DRBD Specific Requirements for ZFS

Creating and Using zpools Layered on Top of DRBD

Adding a DRBD-Backed zpool to Pacemaker

Creating and Using `zpool`s Layered on Top of DRBD

Adding a DRBD-Backed `zpool` to Pacemaker