LINSTOR and DRBD Hardware Considerations

This article focuses on how to select suitable hardware to match workload needs when using LINSTOR® and DRBD®. While LINSTOR hardware requirements are minimal to the point of being negligible, DRBD data replication will impact hardware resources such as disk, processor, memory, and network. Scalability, high availability, disaster recovery, RAID, storage device type, and performance expectations are also factors when selecting suitable hardware for your deployments. These and other related topics are discussed within this article.

Scalability and high availability architecture

A minimum LINSTOR and DRBD cluster requires three nodes, to ensure quorum and prevent data divergence (so-called split-brains). Two nodes are diskful, that is, they have attached physical storage that backs DRBD devices, either directly, or else underneath logical volumes. A third node can be diskless, holding no replication data itself, and acts purely as a tie-breaker. For cost savings, a diskless tie-breaker node does not need to have the same hardware specifications as the diskful nodes in your cluster. You might even use something as basic as a single-board computer, such as a Raspberry Pi.

Balancing performance and fault tolerance

More nodes provide more replicas and higher fault tolerance within the cluster but increase hardware costs and replication traffic which can impact application performance. Weigh the required resilience against budget and performance goals.

Storage devices and performance

By identifying the data read and write requirements of the applications and services that might depend on DRBD replicated data, you can choose suitable physical storage devices on which to store your data.

NVMe and SSD

Non-volatile memory express (NVMe) and solid-state drive (SSD) storage devices are characterized by extremely high IOPS and low latency. This makes NVMe and SSD drives ideal for databases or other I/O-intensive applications. These drives are more expensive than other types of storage devices, such as serial attached SCSI (SAS) and serial AT attachment (SATA) storage devices.

When using NVMe devices, PCIe lanes become an important consideration. If multiple NVMe devices share the same PCIe lane, the available bandwidth is divided among them. This can potentially create a bottleneck. Additionally, when using DRBD, it is crucial to ensure that the NVMe storage devices, network interface controllers (NICs), and CPU are on the same non-uniform memory access (NUMA) node. Otherwise, you might have negative performance impacts due to buffer copies between NUMA nodes.

SAS

SAS storage devices are characterized by better performance than SATA storage devices, but will have greater latency than NVMe or SSD storage devices. SAS storage devices are a good middle ground for cost and performance.

SATA

SATA hard disk drives are generally cheaper and larger capacity spinning hard disk drives. They have much high latency when compared to other storage devices mentioned in this article, particularly NVMe and SSD storage devices. That said, they are still useful for archival or “cold” data. SATA disk drives are less ideal for workloads that are sensitive to I/O latency.

RAID considerations when using DRBD

Although DRBD already keeps fully replicated copies of data on multiple nodes, local RAID can still be beneficial. Each of the following RAID approaches has pros and cons.

No RAID (single storage device or JBOD)

The pros of not using RAID and just using a single storage device or else multiple storage devices underneath a single logical volume group are:

Lower cost and reduced hardware complexity (no RAID controller needed)
Straightforward configuration: DRBD replicates to each node’s storage device directly
Simplified write path: no additional I/O due to RAID setup

The cons of not using RAID are:

If a storage device fails, the DRBD resource on that node will transparently fail over by default (become “diskless”) to a healthy replica on another node. This is a disk-level or resource failover, not a full node failover.
No local redundancy means reliance on cluster-level replication for resilience.

RAID 1 or RAID 10

Both RAID 1 and RAID 10 involve mirroring of data across multiple storage devices. The pros of this type of RAID setup are:

Protects against a single (or multiple) storage device failures on a node, without forcing DRBD resource failover
Helps ensure local, node-level resilience in addition to DRBD cross-node replication
Improves read performance, particularly when using RAID 10.

The cons of using either of these RAID setups are:

Increases hardware cost, because you have more storage devices and a RAID controller per node
Increases setup complexity, particularly when using RAID 10
Potentially decreases write performance, when compared to single-disk or JBOD setup

RAID 5 or RAID 6

Both RAID 5 and RAID 6 involving striping data across multiple storage devices. The pros of this type of RAID setup are:

Good capacity efficiency for a given number of storage devices
Tolerates one (RAID 5) or two (RAID 6) disk failures on a local node

The cons of using either of these RAID setups are:

The write penalty for performing parity calculations can significantly affect I/O and CPU performance.
Latency-sensitive workloads may suffer under synchronous DRBD replication.
You need to consider DRBD metadata when using parity RAID levels.

HBA passthrough

When using any host bus adapter (HBA) passthrough, for example, to allow virtual machines (VMs) direct access to host storage, you need to make some considerations when also using DRBD:

Some setups might prefer to use HBA passthrough mode (no hardware RAID) and let DRBD (or software RAID) handle data.
Using HBA passthrough simplifies the stack but you lose local disk redundancy unless you configure software RAID.

Even with these considerations, DRBD still provides full data replication across nodes.

CPU considerations when using DRBD

The DRBD User Guide documents DRBD hardware requirements. DRBD handles checksum, transport protocol, and other kernel-level operations related to data replication. DRBD itself typically has little impact itself on CPU usage, however, TCP/IP data replication over the network involves the CPU and can place demands on the CPU that might create bottlenecks, depending on the I/O and bandwidth. Particularly with synchronous replication (Protocol C), CPU bottlenecks can limit throughput and increase latency.

CPU clock speed and CPU core count

Two aspects of CPUs, clock speed and core count, are factors to consider when assessing CPU purchases when using DRBD. A higher CPU clock speed can help single or lower-thread replication paths. A higher CPU core count can be useful if you have many DRBD resources, for example, for hypervisor-based workloads, or highly parallel workloads. Modern Intel Xeon or AMD EPYC processors offer a good mix of frequency and core count.

Memory usage considerations when using DRBD

DRBD buffers replication data in the kernel memory space. LINSTOR uses memory for its management processes, the LINSTOR controller and satellite services running on nodes. LINSTOR memory use is modest. In hyperconverged setups, where nodes run other platforms, VMs, or containers, in addition to providing storage, demand on system memory increases significantly.

Recommended memory allocation

A practical minimum amount of memory is 64 GB of RAM per storage node, especially for production or larger volumes. A general rule is that you need about 32 MiB of RAM per 1 TiB of DRBD storage per peer. If you have ten DRBD resources, where each is 100GiB replicated between two peers, you need at least 64 MiB of memory for DRBD replication alone. DRBD supports a maximum device size of 1 PiB (1024 TiB) per resource. If you are using LINSTOR and DRBD in a hyperconverged environment, consider significantly more (256 GB - 1024GB of memory) to comfortably accommodate application workloads and DRBD memory requirements.

Network considerations when using DRBD

DRBD synchronous replication performance heavily depends on network throughput and latency. A low-bandwidth or high-latency link will bottleneck replicated writes and their acknowledgments.

Bandwidth Requirements and Bottlenecks

The following subsections discuss various aspects of network bandwidth and bottlenecks.

Throughput compared with local disk speed

A high-end single NVMe drive can sustain over 8 GB/s. A 10 GbE connection, in real terms, offers around 1.25 GB/s of throughput. If local disk speed exceeds network bandwidth, the network becomes a bottleneck for I/O when using DRBD to replicate data.

Scaling beyond 10 GbE

To take full advantage of NVMe performance, consider 25, 40, or 100 GbE network connections between DRBD replicating nodes. However, because faster network infrastructure (NICs, switches, and others) are more expensive, evaluate the costs to benefits carefully.

Latency sensitivity

Synchronous DRBD writes require acknowledgment from remote nodes. Even with sufficient bandwidth, high latency can degrade performance. Low-latency network hardware, such as fiber-based with a minimal hop count, is recommended. When this is not possible, consider an asynchronous DRBD replication mode.

Redundancy and bonding

Network bonding, for example link aggregation control protocol (LACP), can provide resilience and combine multiple links for increasing bandwidth.

DRBD has an option for load balancing TCP/IP replication traffic. You can also configure DRBD load balancing to use multiple links. Consistent packet delivery and minimal jitter and latency are crucial when configuring redundancy and network bonding.

RDMA

With TCP/IP, you can begin to experience system performance degradation when you reach 10 gigabits per second (Gbps) data transfer speeds. System performance continues to severely degrade from there on. If you are experiencing this type of performance degradation, or anticipate doing so, you might consider using an alternate transport protocol, RDMA.

RDMA transfers data directly between the physical memory of two systems over high-speed networks, for example, InfiniBand or RoCE, bypassing CPU and operating system involvement. This reduces latency and CPU usage, boosting throughput and efficiency for performance-critical storage or clustering environments.

Implementing RDMA requires dedicated and specialized hardware. Again, you need to consider the costs and benefits before purchasing equipment.

Disaggregated storage considerations

If you are using a disaggregated storage design, where hosts connect to separate storage nodes, you need to consider the following.

Each DRBD resource mirrors data to at least two diskful storage nodes. In a disaggregated storage design, your compute hosts send every write operation over the network to each storage node that holds a replica, so-called diskful nodes. If you are using some other storage interface, such as NFS, NVMe-oF, or iSCSI, you are only sending writes to the DRBD primary node. The primary node then replicates the changes to secondary nodes. If there is only a single 10GbE link connecting a host to the storage network, the same data block must travel twice in a typical 3-node cluster (2 diskful nodes + 1 diskless tie-breaker node): once for each diskful replica node. Because the 10GbE link has a fixed maximum throughput, about 1.25 GB/s in real-world terms, the available bandwidth is effectively shared or “split” among these simultaneous streams.

This means that you might need to consider throughput sharing and possible network bottlenecks.

Throughput sharing: If you are saturating the link, each write path to each node can only use a portion of the total 10GbE bandwidth.
Increased disk throughput and network bottlenecks: Even if your high-performance local disks, such as NVMe or SSD storage devices, can handle higher throughput, the DRBD replication traffic over a single 10GbE link might become the limiting factor.

To mitigate against these concerns, you might consider implementing:

Link bonding: Aggregate multiple 10GbE links, for example, by using LACP or enabling DRBD load balancing, to increase the total available bandwidth and provide redundancy on the hosts.
Faster networks: Consider moving to 25GbE, 40GbE, or 100GbE if disk performance and workload demands warrant it and the benefits justifies the costs. Traffic separation: Use separate physical or logical networks to separate replication traffic from normal data traffic, reducing contention for bandwidth.

In short, because each write must be sent to two (or more) storage nodes, that 10GbE link is handling multiple parallel data streams of the same write operation, effectively reducing the per-stream bandwidth available if the link is near saturation. In practice, your “usable” bandwidth for each data copy can be halved. You should plan for enough aggregate bandwidth to avoid saturation if multiple DRBD resources or large writes occur in parallel.

Remote replication and other techniques for disaster recovery

Unless configured otherwise, DRBD replicates data synchronously by default. DRBD also supports asynchronous or semi-synchronous replication modes across geographically separated locations, typically found when creating a disaster recovery plan for your data. If the bandwidth or latency between network links is insufficient for fully synchronous writes, you have alternatives:

DRBD Proxy

A specialized component that buffers and optionally compresses DRBD traffic over slower or higher-latency links. Helps reduce the immediate write penalty for the primary site by buffering bursts of data and sending them asynchronously (or near-synchronously) to the remote location. DRBD Proxy is currently proprietary code, available only to LINBIT® customers.

Backup and snapshot shipping with LINSTOR

If bandwidth is limited or latency is too high for continuous replication, you can periodically create snapshots of LINSTOR-managed DRBD resources and transfer them off-site, by using snapshot (also called backup) shipping. Snapshot shipping requires that your LINSTOR resources are backed by LVM or ZFS thin-provisioned volumes. This is a different type of replication than standard DRBD replication, but it can be a viable approach for disaster recovery in environments with constrained network resources.

Asynchronous DRBD Protocols (A)

By using the DRBD asynchronous replication protocol A, write operations are acknowledged locally before a DRBD peer node confirms the remote write. This reduces local I/O latency but introduces a risk of data loss if the primary site fails before remote replication is complete, because the peer might not have an up-to-date copy of the data.

Recommendations for specific workloads

Using LINSTOR and DRBD with certain application or platform-based workloads might require specific hardware considerations. Some of the more common use cases and recommendations for them are:

Databases

When using DRBD to replicate database data, particularly online transaction processing (OLTP) or online analytical processing (OLAP) driven database data, high IOPS and low latency are crucial. For this use case, NVMe or SSD storage is recommended. Also consider 25, 40, or 100 GbE network links to match disk performance. RAID 10 or local SSD READ helps ensure local fault tolerance without immediate resource failover when a disk within the RAID fails.

Virtualization and containers

When using DRBD to replicate storage data that backs virtualized or containers workloads, the random I/O patterns typically found in this scenario will benefit from fast SSD or NVMe disks. Ensure 64GB or more memory per node, or more for hyperconverged environments. Using at least 10 GbE network links is recommended.

Archiving and backups

When using DRBD to create highly available archival or backup data, higher capacity takes precedence over higher performance. For this use case, you can use SAS or SATA hard disk drives as a cost effective solution. Using RAID 6 or RAID 10 might strike a balance between capacity and local redundancy targets.

Network file shares

When using DRBD to create highly available network file shares, such as NFS, SMB, or CIFS, you will want moderate to high throughput, with some IOPS to spare for metadata operations. A 10GbE or faster network is recommended for multi-user environments. Implementing RAID 10 or SSD caching can boost performance for frequently accessed files.

Sample hardware architecture

To try to bring the recommendations in this article together, the following is a reference build for a typical small-to-medium sized LINSTOR and DRBD production setup:

Server

Each DRBD diskful node should have:

CPU: 8+ core, 2.0+ GHz, for example, Intel Xeon or AMD EPYC
Memory: Minimum 64 GB, more if hyperconverged, for example, 128 GB
Disk:
- Dedicated SSD disk for operating system and LINSTOR management components
- For data, either RAID-10 with SSD, SAS, or NVMe disks, or else a single-disk JBOD configuration, depending on the local redundancy your needs might require
Network: Dual-port NIC at 10 GbE or higher, ideally bonded
Cluster Size:
- At least three nodes for DRBD quorum. One node can be diskless to act as a tie-breaker. The tie-breaker node does not need to match the hardware specifications of the diskful nodes.
- You can add more nodes to increase capacity while taking performance scaling into consideration
Redundancy:
- Use RAID 10 if you require local disk fault tolerance without immediate DRBD resource failover on a disk error.
- Single disk or JBOD is possible if cost is a bigger concern and immediate resource failover on disk failure is acceptable.

Written by YY, 2025-03-04.

Reviewed by MDK, 2025-03-05.