Tips for using ZFS and Zpools over DRBD with Pacemaker.

ZFS is a popular modern filesystem and its use over top of DRBD requires some special considerations. This article will touch on the requirements for building a system with ZFS over DRBD managed by Pacemaker.

Prerequisites

  • DRBD® 9.x resource is created and started on all nodes.
  • Pacemaker and Corosync are configured and started on all nodes.
  • ZFS is installed for your distribution (link to ZFS getting started documentation).

DRBD Specific Requirements for ZFS

For DRBD to be used underneath ZFS, we must disable DRBD 9's auto-promotion feature. ZFS doesn't hold the device open in the kernel the same way other filesystems or processes in Linux do, and therefore will not cooperate with DRBD's auto-promote features.

resource r0 {
  device    /dev/drbd0;
  disk      /dev/sdb;
  meta-disk internal;

  options {
    auto-promote no;
  }
 
on zfs-0 {
    address   192.168.222.20:7777;
    node-id   0;
  }
 
on zfs-1 {
    address   192.168.222.21:7777;
    node-id   1;
  }
 
on zfs-2 {
    address   192.168.222.22:7777;
    node-id   2;
  }

  connection-mesh {
      hosts zfs-0 zfs-1 zfs-2;
  }
}

Also, if you're planning on using multiple DRBD devices to create a zpool, you will want to use a multi-volume DRBD configuration.

resource r0 {
 volume 0 {
device minor 0;
disk "/dev/sdb";
meta-disk internal;
}

volume 1 {
device minor 1;
disk "/dev/sdc";
meta-disk internal;
}

  on zfs-0 {
    address   192.168.222.20:7777;
    node-id   0;
  }
 
  on zfs-1 {
    address   192.168.222.21:7777;
    node-id   1;
  }
 
  on zfs-2 {
    address   192.168.222.22:7777;
    node-id   2;
  }

  connection-mesh {
        hosts zfs-0 zfs-1 zfs-2;
  }
}

Creating and Using zpools over DRBD

Once your DRBD resource is created, promote it on a single node and begin creating your zpool.

# drbdadm primary r0

# zpool create new-pool /dev/drbd0

# zpool status
  pool: new-pool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        new-pool    ONLINE       0     0     0
          drbd0     ONLINE       0     0     0

errors: No known data errors

You should now see you have the ZFS filesytem mounted at /new-pool.

# mount | grep new-pool
new-pool on /new-pool type zfs (rw,xattr,noacl)

Test putting something into the mount point, and then export (umount and stop) the zpool, and demote the DRBD device. 

# zpool export -f new-pool
# zpool status
no pools available

# mount | grep new-pool

# drbdadm secondary r0

Now the DRBD device can be promoted on a different node, and the zpool can be imported and used there. You should see whatever data you placed into /new-pool on the previous node has been replicated to all peers.

# drbdadm primary r0
# zpool import -o cachefile=none new-pool

# zpool status
  pool: new-pool
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        new-pool    ONLINE       0     0     0
          drbd0     ONLINE       0     0     0

errors: No known data errors

# mount | grep new-pool
new-pool on /new-pool type zfs (rw,xattr,noacl)

Adding a DRBD backed zpool to Pacemaker

To make a zpool backed by DRBD a part of Pacemaker configuration you need to make sure the DRBD device backing the zpool gets started and promoted before the zpool is imported. You also need to make sure the zpool is imported on the node where DRBD has been promoted.

In Pacemaker specific terms, the zpool must be colocated with the DRBD primary, and ordered to start only after DRBD is promoted.

The following examples satisfy these requirements in both crmsh and pcs configurations, respectively:

primitive p_drbd_r0 ocf:linbit:drbd \
        params drbd_resource=r0 \
        op start interval=0s timeout=240 \
        op promote interval=0s timeout=90 \
        op demote interval=0s timeout=90 \
        op stop interval=0s timeout=100 \
        op monitor interval=29 role=Master \
        op monitor interval=31 role=Slave
primitive p_zfs ZFS \
        params pool=new-pool \
        op start interval=0 timeout=60s \
        op stop interval=0 timeout=60s \
      op monitor interval=20 timeout=40s
ms ms_drbd_r0 p_drbd_r0 \
        meta master-max=1 master-node-max=1 notify=true clone-node-max=1 clone-max=3
colocation cl_p_zfs-with-ms_drbd_r0 inf: p_zfs:Started ms_drbd_r0:Master
order o_ms_drbd_r0-before-p_zfs ms_drbd_r0:promote p_zfs:start
<resources>
  <primitive id="p_zfs" class="ocf" provider="heartbeat" type="ZFS">
    <instance_attributes id="p_zfs-instance_attributes">
      <nvpair name="pool" value="new-pool" id="p_zfs-instance_attributes-pool"/>
    </instance_attributes>
    <operations>
      <op name="start" interval="0" timeout="60s" id="p_zfs-start-0"/>
      <op name="stop" interval="0" timeout="60s" id="p_zfs-stop-0"/>
      <op name="monitor" interval="20" timeout="40s" id="p_zfs-monitor-20"/>
    </operations>
  </primitive>
  <master id="ms_drbd_r0">
    <meta_attributes id="ms_drbd_r0-meta_attributes">
      <nvpair name="master-max" value="1" id="ms_drbd_r0-meta_attributes-master-max"/>
      <nvpair name="master-node-max" value="1" id="ms_drbd_r0-meta_attributes-master-node-max"/>
      <nvpair name="notify" value="true" id="ms_drbd_r0-meta_attributes-notify"/>
      <nvpair name="clone-node-max" value="1" id="ms_drbd_r0-meta_attributes-clone-node-max"/>
      <nvpair name="clone-max" value="3" id="ms_drbd_r0-meta_attributes-clone-max"/>
    </meta_attributes>
    <primitive id="p_drbd_r0" class="ocf" provider="linbit" type="drbd">
      <instance_attributes id="p_drbd_r0-instance_attributes">
        <nvpair name="drbd_resource" value="r0" id="p_drbd_r0-instance_attributes-drbd_resource"/>
      </instance_attributes>
      <operations>
        <op name="start" interval="0s" timeout="240" id="p_drbd_r0-start-0s"/>
        <op name="promote" interval="0s" timeout="90" id="p_drbd_r0-promote-0s"/>
        <op name="demote" interval="0s" timeout="90" id="p_drbd_r0-demote-0s"/>
        <op name="stop" interval="0s" timeout="100" id="p_drbd_r0-stop-0s"/>
        <op name="monitor" interval="29" role="Master" id="p_drbd_r0-monitor-29"/>
        <op name="monitor" interval="31" role="Slave" id="p_drbd_r0-monitor-31"/>
      </operations>
    </primitive>
  </master>
</resources>
<constraints>
  <rsc_colocation id="cl_p_zfs-with-ms_drbd_r0" score="INFINITY" rsc="p_zfs" rsc-role="Started" with-rsc="ms_drbd_r0" with-rsc-role="Master"/>
  <rsc_order id="o_ms_drbd_r0-before-p_zfs" first="ms_drbd_r0" first-action="promote" then="p_zfs" then-action="start"/>
</constraints>

After configuring and committing your changes to Pacemaker you should have a simple ZFS failover cluster.

# crm_mon -1rD
Node List:
* Online: [ zfs-0 zfs-1 zfs-2 ]

Full List of Resources:
* p_zfs       (ocf::heartbeat:ZFS):    Started zfs-2
  * Clone Set: ms_drbd_r0 [p_drbd_r0] (promotable):
  * Masters: [ zfs-2 ]
  * Slaves: [ zfs-0 zfs-1 ]


Last Reviewed by MDK – 4/20/22