Auditing TRIM Behavior Across File System, DRBD, and LVM Layers

Freeing storage space, or trimming, on solid-state drives (SSDs) or non-volatile memory express (NVMe) devices is important to reduce wear and tear and to improve device performance. The Linux fstrim file system trimming utility is often used to reclaim free space on TRIM-capable storage devices. Many Linux distributions either ship with an enabled service that runs fstrim periodically, or else if they do not, maintainers recommend periodic trimming.

When using fstrim, two concepts are important to remember. First, some file systems might have a concept of “stale state”, that is, whether there is any trimming for fstrim to do. Second, the discard granularity that the fstrim utility uses might not propagate correctly through the whole storage stack down to the physical back-end device. This might have implications that can negatively affect DRBD® deployments if not understood and mitigated for, especially in deployments running fleets of virtual machines, such as CloudStack, KubeVirt, Oracle Virtualization, Proxmox VE, Xen Orchestra, and others.

In this article, TRIM, discard, UNMAP, and deallocate are words used to describe a conceptually similar operation across a variety of physical storage devices (SCSI, SATA, SSD, NVMe, and others) that might support reclaiming free storage space. Familiarity with TRIM and TRIM-capable devices is assumed but if you need to refresh yourself, the solid state drive Arch Linux wiki page is a good overview of the topic.

This article is organized into three sections. The first section describes TRIM behavior conceptually at each layer of the storage stack. The second section provides operational procedures for verifying and managing TRIM in a live deployment. The third section contains step-by-step demonstrations for reproducing the described behaviors in a test environment. Footnote references in the article collect detailed technical information and notes.

Understanding TRIM behavior across the storage stack

This section describes concepts and mechanisms related to TRIM behavior in a storage stack. It does not describe operational steps.

Potentially misleading TRIM reporting

When a file system calls fstrim (or the kernel issues an online discard), the tool can report how many bytes it submitted for deallocation, when called with the verbose (-v) flag. However, the reported number is not a confirmation that anything was actually deallocated on the physical storage device. It is the number of bytes passed down the block stack for potential discard, as the fstrim manual page says:

“This number is a maximum discard amount from the storage device’s perspective […] fstrim will report the same potential discard bytes each time but only sectors which had been written to between the discards would actually be discarded by the storage device.”

Every layer in the stack between the file system and the physical storage device can drop, adjust, or transform a discard request without returning an error to the caller. In one important case, a periodic trim call might generate a burst of zero-fill writes rather than discards.

The storage stack

A typical virtualized storage path with DRBD used for block-level replication looks like this:

File system (EXT4, XFS, ...)
        |
   DRBD
        |
   Logical volume (LVM thin, ZFS thin, or "thick" variants)
        |
   Actual block device (such as an SSD or NVMe device)

A TRIM (discard) request originates at the file system and must go through every layer. Each layer makes its own decision about what to do with it, and might have its own opinion on things such as discard granularity.

TRIM behavior in the file system layer

The file system determines which logical block addressing (LBA) ranges are free, and submits discard requests for them through the FITRIM ioctl, the kernel interface underlying fstrim. This is where the reported values shown from a fstrim verbose command come from. Again, it reflects what the file system submitted and not necessarily what happened in the underlying storage layers.

Beyond this, within the file system layer, different file systems might handle trimming differently. For example, EXT4 and XFS behave differently in an important way, as described in the following sections.

EXT4 stale TRIM state

EXT4 maintains an in-memory record of which free extents have been recently trimmed. On a later fstrim request, the tool skips those recently trimmed extents, submitting nothing for them. If a user enters the later fstrim command with a verbose flag, the utility will report nothing trimmed (0 B trimmed). This is an efficiency because if nothing has changed, there is no point re-submitting the same ranges for trimming.

However, a TRIM that runs immediately after blocks are freed might still submit nothing for those blocks. When blocks are allocated and then freed (files written and deleted) after a TRIM, EXT4 reflects the freed extents in its trim tracking only after the deallocation is committed to the journal, by an explicit sync or at the next periodic journal commit. A TRIM that runs before that commit submits nothing for those extents and reports 0 B trimmed for them, even though the blocks are already free from the perspective of applications.

This state is held in memory only and goes away after an unmount event.

On a freshly mounted EXT4 file system on top of a 1GiB block device, with no prior trim state, the first fstrim submits all free space:

/mnt/test: 973.1 MiB (1020416000 bytes) trimmed

A second fstrim with no intervening file activity submits nothing:

/mnt/test: 0 B (0 bytes) trimmed

After allocating and freeing blocks without a preceding sync, EXT4 has not yet committed the deallocation to its journal, so a fstrim still submits nothing. After a sync that commits the deallocation, a fstrim recognizes that the affected extents need trimming.

/mnt/test: 111.2 MiB (116559872 bytes) trimmed

📝 NOTE: The output shows that more than 100 MiB was submitted for trimming even though only 100 MiB was freed. EXT4 tracks trim state per block group.¹ When a deallocation invalidates the trim state of a group, the next fstrim re-submits all free extents in that group, not only the changed range. The 100 MiB deallocation invalidated one or more groups, causing their entire free space to be re-submitted.

Unmounting and remounting the file system clears the in-memory trim state, causing the next fstrim to re-submit all free space. A reboot has the same effect. However, using mount -o remount does not clear the in-memory state.

Implications for periodic file system trimming

If the fstrim.timer runs weekly (as it usually does or might be recommended to do) and no files have been created and deleted since the last run, EXT4 will report 0 B trimmed and submit nothing. This is correct and efficient. However, if an EXT4 stale state is intact while the underlying stack has changed, for example, a DRBD reconfiguration that now zero-fills discards, a fstrim run that reports 0 B does not confirm that DRBD did not zero-fill. Rather, it could mean that EXT4 submitted no discards for DRBD to process.

XFS TRIM behavior

In contrast to EXT4 behavior, XFS does not maintain a “recently trimmed” state. Every fstrim call trims all current free space, regardless of whether it was trimmed before.

On a freshly formatted XFS file system, both the first and second fstrim calls submit all free space:

/mnt/test: 999 MiB (1047527424 bytes) trimmed

XFS always submits the full free space down the stack on every fstrim call. This makes the risk of a zero-fill write storm, described in a later section, higher, but more predictable with XFS than with EXT4.

TRIM behavior in the DRBD layer

How DRBD behaves on receiving a discard request depends on configuration and on what the DRBD kernel module knows about the backing device.

DRBD requires that aligned discards on the backing device can be trusted to return zeros on readback. This is needed for consistency between DRBD peer nodes. If two peers apply a discard but only one backing device zeros the range, the two peers diverge. This divergence is unreported and a drbdadm status shows UpToDate/UpToDate on both nodes because DRBD considers the operation complete once both peers have acknowledged it.

Because the discarded range is free space from the perspective of the file system, the differing content is not corruption of live data. At the block level, however, the two DRBD replicas are no longer identical, which is the consistency that DRBD is designed to maintain. DRBD can report this divergence during an online verification (drbdadm verify) as out-of-sync blocks, even though no “live” or file data is affected. The DRBD disk option discard_zeroes_if_aligned tells DRBD to use an alignment-aware discard strategy, that is, DRBD zeros unaligned partial areas at the head and tail of a discard range itself, then passes aligned full-chunk discards through to the backing device.² By setting this option to yes, which is the default, an administrator asserts that aligned full-chunk discards on the backing device return zeros on readback, even if the backing device provides no kernel-visible guarantee of this behavior. When this option is set to no, or when the backing device does not support discard at all, DRBD falls back to calling blkdev_issue_zeroout across the entire range rather than passing any discard through.³

When the backing device has no discard support, this generates sequential write I/O across the full range submitted for discard. When discard_zeroes_if_aligned is no but the backing device does support discard, DRBD calls blkdev_issue_zeroout without the BLKDEV_ZERO_NOUNMAP flag, permitting the kernel to satisfy the range by using UNMAP rather than writes. The actual I/O type in that case depends on what zeroing mechanisms the backing device advertises. In either case, the caller receives a success return code and the file system has no indication that the operation differed from a normal discard. The storage node receives a full-range I/O workload before the request reaches the volume manager.

The default value of discard_zeroes_if_aligned is yes.⁴ With the default, DRBD uses the alignment-aware strategy and does not zero-fill the full range. Setting the option to no causes DRBD to zero-fill every discard regardless of backing device capability.

📝 NOTE: rs_discard_granularity is a separate DRBD option that applies only during resynchronization, when the DRBD_FF_THIN_RESYNC protocol feature is negotiated between nodes.⁵ The option controls an optimization where, during resynchronization, the source node can tell the target node that a range is all zeros and have the target deallocate it, rather than sending the zeros across the network link for the target to rewrite. It does not affect how fstrim discards are handled during normal operation.

❗IMPORTANT: When using LVM under DRBD, do not disable LVM thin pool block zeroing by setting skip_block_zeroing=true. DRBD relies on discarded ranges reading back as zeros for the resynchronization optimization just described. With zeroing disabled, LVM does not clear a chunk when it re-provisions it, so a range that was zero, then discarded during resynchronization, can read back as stale data once a later partial write re-provisions the chunk. This corruption is unreported during normal operation, where drbdadm status shows UpToDate/UpToDate on both nodes. However, an online verification (drbdadm verify) will report the diverged ranges as out-of-sync, and a read served from the affected peer will return garbage where zeros are expected. Keep the default, skip_block_zeroing=false.

TRIM behavior in the volume manager layer

Both LVM thin provisioning and ZFS thin provisioning impose a minimum discard granularity. Requests smaller than that granularity are dropped without error before reaching the physical device. The following examples use LVM thin provisioning to illustrate this behavior. DRBD is often layered over LVM logical volumes but you can also layer it over ZFS volumes. ZFS thin provisioning has similar constraints, governed by volblocksize rather than the chunk size that LVM uses.

LVM thin provisioning only acts on discards that are aligned to and at least as large as its chunk size. Smaller or misaligned discards are dropped at the LVM layer without error and never reach the physical device below. The effective discard granularity seen by DRBD (and therefore the file system) is determined by whichever storage layer underneath DRBD has the coarsest granularity, that is, has the largest discard granularity value. Often this is the LVM chunk size, not the native granularity of the physical device.

Linear (thick-provisioned) logical volumes do not have a chunk size. An lvs command will report 0 for chunk_size on a linear logical volume. This does not mean discard is unsupported, rather it means the chunk size concept does not apply. Linear logical volumes pass the discard geometry of the underlying physical volume upward unchanged, but do not themselves impose an additional granularity constraint. The thin provisioning discard behavior described here applies only when DRBD is backed by a thin-provisioned logical volume.

The lv_attr field from lvs at position 1 indicates the volume type: t is thin pool, V is thin volume, and - (hyphen) is a linear logical volume. A non-zero chunk_size value alongside a thin pool type is the granularity that limits discard pass-through.

DISC-GRAN in lsblk --discard output is distinct from the LVM chunk size. DISC-GRAN is the discard granularity advertised by the kernel block layer at that device node. If DISC-GRAN is 0, that layer does not advertise discard support to the layer above it. If DISC-GRAN is nonzero, the layer advertises discard support but requests smaller than that value will be dropped without error.

On an LVM thin volume backed by a 64 KiB chunk pool, lsblk --discard shows DISC-GRAN=64K on the thin volume device:

NAME                              DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
lvmtest                                  0      512B       4G         0
├─lvmtest_vg-lvmtest_pool_tmeta          0      512B       4G         0
│ └─lvmtest_vg-lvmtest_pool-tpool        0      512B       4G         0
│   ├─lvmtest_vg-lvmtest_pool            0      512B       4G         0
│   └─lvmtest_vg-lvmtest_vol             0       64K      64M         0
└─lvmtest_vg-lvmtest_pool_tdata          0      512B       4G         0
  └─lvmtest_vg-lvmtest_pool-tpool        0      512B       4G         0
    ├─lvmtest_vg-lvmtest_pool            0      512B       4G         0
    └─lvmtest_vg-lvmtest_vol             0       64K      64M         0

The tree shown in the output is deeper than the logical volume hierarchy suggests. LVM exposes the thin pool internal metadata device (_tmeta) and data device (_tdata) as separate device mapper targets, and lsblk renders the shared tpool target under both parents, producing duplicate subtrees. The relevant devices for this demonstration are lvmtest_vg-lvmtest_pool (the pool logical volume) and lvmtest_vg-lvmtest_vol (the thin volume).

The backing device and the pool logical volume both advertise DISC-GRAN=512B. Only the thin volume advertises DISC-GRAN=64K. The dm-thin target enforces chunk alignment at that device boundary before passing any discard down to the pool. DISC-MAX=64M on the thin volume is the maximum size the dm-thin target accepts for a single discard request; requests larger than this are split before being passed down.

The DISC-GRAN=64K minimum is a separate constraint: any discard request smaller than 64 KiB submitted to the thin volume will not reach the backing device.

TRIM behavior risk in large or virtualized deployments

Many Linux distributions ship with fstrim.timer enabled by default. Other distributions might ship with it disabled but periodic trimming is almost universally recommended. When enabled, the systemd timer runs fstrim on all mounted file systems, typically weekly, often on a Sunday.

In a large virtualized environment, there are two distinct I/O risks when fstrim runs concurrently across many VMs.

Discard I/O volume: A periodic fstrim run across many VMs generates a concurrent burst of discard requests against the shared storage back end. Each VM might submit hundreds of MiB or GiB of discard ranges. Even when every layer passes discards through correctly, this can saturate storage nodes, increasing latency, causing I/O queue buildup, or causing unavailability.
Write conversion: If any layer converts discard requests to write I/O, the concurrent I/O burst becomes more severe and, on thin-provisioned storage, the operation consumes allocated space rather than freeing it. At the DRBD layer, this can occur when the backing device reports no discard support or when discard_zeroes_if_aligned is set to no.⁶

Auditing and Managing TRIM

This section describes operational procedures. It assumes familiarity with the concepts described in the preceding section.

Auditing TRIM pass-through in your storage stack

Because fstrim -v output reflects only what was submitted at the file system layer, you need lower-level tools to determine whether discards are passing through each layer or being converted to writes.

Verifying discard statistics at the block device level

Enter the following command to check the discard statistics of a block device:

cat /sys/block/test/stat

Fields 9–11 of this file are discard I/O events, discard merges, and discard sectors. Compare before and after fstrim to see whether discards actually reached the device.

❗IMPORTANT: In the past, the field positions in /sys/block/*/stat have changed across kernel versions. Verify the positions against Documentation/block/stat.rst in your kernel source tree before referencing specific field positions.

Tracing block layer events

The blktrace command captures actual block layer events including discard requests. Verify the actual behavior of your storage stack by tracing the DRBD backing device (the logical volume such as an LVM logical volume or ZFS volume, not the DRBD device) while running fstrim on the file system above DRBD.

Identify the backing device from the resource configuration:

drbdadm dump <resource> | grep disk

Trace the backing device during a TRIM:

blktrace -d /dev/<backing-vol> -D /tmp -o drbd-trace &
BTPID=$!
fstrim -v /mnt/data
kill $BTPID
wait $BTPID 2>/dev/null

Show the completed I/O types:

blkparse /tmp/drbd-trace | awk '$6 == "C" { print $7 }' | sort | uniq -c

In the blkparse output, $6 is the action field and $7 is the RWBS field. C selects completed operations.

❗IMPORTANT: Verify field positions for your installed version with man blkparse before adapting this command.

Interpret the counts as follows:

A large number of W or WS entries proportional to the trimmed range. In this case, DRBD is zero-filling. The discard was converted to sequential writes across the entire range.
D entries signify that discards passed through DRBD and reached the backing logical volume as discard requests.
A small count of WSM entries and nothing else signify that discards passed through DRBD, but the LVM layer absorbed them without issuing block I/O. The WSM entries (Write + Sync + Meta) are DRBD activity log updates, not data writes. A thick logical volume (dm-linear) does not forward discards to the underlying physical volume. A blktrace on it will not show D events for those requests.
If there are no entries shown then blktrace was not running during the trim, or the device is not receiving requests from the layer above.

Verifying DRBD discard configuration

Enter the following command to show the running configuration of your DRBD resource, including options that were not explicitly set:

drbdsetup show --show-defaults <resource>

Look for discard-zeroes-if-aligned in the output.

Re-trimming a live EXT4 file system

When a file system is in active use and cannot be unmounted, you can use the fallocate and sync approach to clear a stale trim state.

Check available free space before allocating:

df /mnt/test

Allocate approximately 90 percent of the reported available space. Avoid allocating the full available space, because this risks causing a “no space left on device” for active users of the file system. Remove the temporary file, force a journal commit, and trim.

fallocate -l <~90-percent-of-available-space> /mnt/test/scratch
rm /mnt/test/scratch
sync
fstrim -v /mnt/test

This approach relies on fallocate allocating blocks immediately, so the rm frees real blocks and the following sync commits that deallocation. Consuming the space with a normal file write instead, for example by using dd, would require committing the write before removing the file, because EXT4 delayed allocation does not allocate blocks until writeback and a file removed before then frees nothing.

Mitigating discard I/O risk

To reduce the risk of I/O saturation or write conversion from periodic fstrim, you have a few options.

Option 1: Verifying discard support end-to-end before enabling `fstrim`

Before relying on periodic fstrim, verify that every layer in your storage stack passes discards through to the underlying block device. Use lsblk --discard at each device level and blktrace to verify, as described earlier.

Option 2: Disabling the systemd `fstrim` timer

If your storage stack cannot support periodic TRIM that might happen at the same time in a large deployment, disable the systemd fstrim.timer service:

systemctl disable fstrim.timer

Run fstrim manually and selectively on file systems where end-to-end discard support is confirmed.

Option 3: Setting a minimum discard size to reduce the amount of TRIM requests

If discard is supported but the back end has a coarse granularity, for example, an LVM chunk size of 2 MiB, tell fstrim not to bother with smaller extents by specifying a minimum size value:

fstrim -v --minimum 2m /mount/point

This reduces the number of discard requests and avoids requests that would be dropped without error at the volume manager layer anyway.

Option 4: Staggering the `fstrim` timer across VMs

Even when discards pass through end-to-end, a concurrent burst of discard I/O across many VMs can saturate the storage back end. Staggering periodic trimming events across your deployment reduces the peak I/O load. One way you might do this is by using the systemd timer setting RandomizedDelaySec to spread the load.

/etc/systemd/system/fstrim.timer.d/randomize.conf
[Timer]
RandomizedDelaySec=6h

Reproducing the behavior

This section contains step-by-step experiments for verifying the behaviors described in the first section. These are not operational procedures for production environments. They require root privileges and use the null_blk kernel module, which creates memory-backed block devices with configurable discard support.

Setting up a null block device

The following commands create a memory-backed null block device with discard support by using the kernel configfs interface. As the root user or prefaced with sudo, enter the following commands to emulate a 1GiB block device with discard capability.

modprobe null_blk nr_devices=0
mkdir /sys/kernel/config/nullb/test
echo 1024 > /sys/kernel/config/nullb/test/size
echo 512  > /sys/kernel/config/nullb/test/blocksize
echo 1    > /sys/kernel/config/nullb/test/memory_backed
echo 1    > /sys/kernel/config/nullb/test/discard
echo 1    > /sys/kernel/config/nullb/test/power

The nr_devices=0 flag suppresses automatic device creation so the device can be configured through the kernel configfs interface before it powers on. Setting memory_backed=1 is required for discard support because the null block device needs a backing store to track allocated sectors. Setting discard=1 configures the device to advertise discard support. Powering the device on then creates the block device with those parameters active.

Enter the following command to show the discard capabilities of the just-created device:

lsblk --discard /dev/test

Output will look like this:

NAME    DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
test           0      512B       4G         0

Create a mount point for the file system demonstrations:

mkdir -p /mnt/test

Demonstrating EXT4 stale trim state

Create an EXT4 file system on the null block device and mount it:

mkfs.ext4 /dev/test
mount /dev/test /mnt/test

Submit a TRIM request for the file system and use the verbose flag:

fstrim -v /mnt/test

On this first trim, the EXT4 file system has no prior trim state, so all free space is submitted for trimming:

/mnt/test: 973.1 MiB (1020416000 bytes) trimmed

Submit a second trim request:

fstrim -v /mnt/test

Output will show that this time EXT4 skips everything. It considers the free space already trimmed.

/mnt/test: 0 B (0 bytes) trimmed

Next, allocate and free some blocks, without synchronizing the file system first.⁷ Then enter another trim request.

fallocate -l 100m /mnt/test/scratch
rm /mnt/test/scratch
fstrim -v /mnt/test

The TRIM request should report that zero bytes were submitted for trimming. The fallocate allocation and the rm deallocation have not yet been committed to the journal, so EXT4 has not updated its per-block-group trim tracking to reflect the freed extents. A freed block is not reflected in the trim tracking until the transaction that recorded the deallocation commits to the journal. This means that a TRIM request that runs between the deallocation and the next journal commit submits nothing for those extents. Waiting for the next periodic journal commit, rather than forcing one with a sync command, has the same effect.

In this next experiment, repeat the allocating and freeing operations but add a synchronization step after the deletion to force a journal commit before trimming:

fallocate -l 100m /mnt/test/scratch
rm /mnt/test/scratch
sync
fstrim -v /mnt/test

Output from this test will show that after a synchronization, EXT4 recognizes that the affected extents need trimming again:

/mnt/test: 111.2 MiB (116559872 bytes) trimmed

📝 NOTE: The output shows that more than 100 MiB was submitted for trimming. EXT4 tracks trim state per block group.¹ When a deallocation invalidates the trim state of a group, the next fstrim re-submits all free extents in that group, not only the changed range. The 100 MiB deallocation invalidated one or more groups, causing their entire free space to be re-submitted.

You can also clear the stale state by unmounting and remounting the file system. A reboot has the same effect. However, using mount -o remount does not clear the in-memory state.

umount /mnt/test
mount /dev/test /mnt/test
fstrim -v /mnt/test

Output from the fstrim command should show that the in-memory state does not persist:

/mnt/test: 973.1 MiB (1020416000 bytes) trimmed

Demonstrating XFS TRIM behavior

Unmount the EXT4 file system and format the device with XFS:

umount /mnt/test
mkfs.xfs /dev/test
mount /dev/test /mnt/test
fstrim -v /mnt/test

Output will show all space submitted for trimming:

/mnt/test: 999 MiB (1047527424 bytes) trimmed

Re-submit a TRIM request:

fstrim -v /mnt/test

Output shows again that all space was submitted for trimming:

/mnt/test: 999 MiB (1047527424 bytes) trimmed

This example shows XFS always submits the full free space down the stack on every fstrim call. This makes the risk of a zero-fill write storm higher, but more predictable with XFS than with EXT4.

Demonstrating LVM discard granularity enforcement

The following command sequence uses a second null block device with an LVM thin pool on top:

modprobe null_blk nr_devices=0 # if you did not load the module earlier
mkdir /sys/kernel/config/nullb/lvmtest
echo 1024 > /sys/kernel/config/nullb/lvmtest/size
echo 512  > /sys/kernel/config/nullb/lvmtest/blocksize
echo 1    > /sys/kernel/config/nullb/lvmtest/memory_backed
echo 1    > /sys/kernel/config/nullb/lvmtest/discard
echo 1    > /sys/kernel/config/nullb/lvmtest/power

LVM does not recognize a null block device as a known device type by default. The --config argument to LVM commands overrides the accepted types list for the duration of each command. Assign it to a variable to avoid repeating it.

LVM_CFG='devices{types=["nullb",16]}'

Create a physical volume, volume group, thin pool with an explicit 64 KiB chunk size, and a thin volume:

pvcreate --config "$LVM_CFG" /dev/lvmtest
vgcreate --config "$LVM_CFG" lvmtest_vg /dev/lvmtest
lvcreate --config "$LVM_CFG" \
  --type thin-pool \
  -L 512M \
  --chunksize 64k \
  -n lvmtest_pool lvmtest_vg
lvcreate --config "$LVM_CFG" \
  -V 2G \
  --thin lvmtest_vg/lvmtest_pool \
  -n lvmtest_vol

Show the logical volume type and chunk size:

lvs --config "$LVM_CFG" -o lv_name,lv_attr,chunk_size lvmtest_vg

Output will show the following:

  LV           Attr       Chunk
  lvmtest_pool twi-aotz-- 64.00k
  lvmtest_vol  Vwi-a-tz--     0

The t at position 1 of lv_attr identifies the thin pool and the V identifies the thin volume. The thin pool carries the 64 KiB chunk size. chunk_size is 0 for the thin volume because the constraint belongs to the pool, not the volume.

Verify that the chunk size propagates as DISC-GRAN to the device nodes visible to layers above:

lsblk --discard /dev/lvmtest

Output will show the following tree view of the lvmtest null block device:

NAME                              DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
lvmtest                                  0      512B       4G         0
├─lvmtest_vg-lvmtest_pool_tmeta          0      512B       4G         0
│ └─lvmtest_vg-lvmtest_pool-tpool        0      512B       4G         0
│   ├─lvmtest_vg-lvmtest_pool            0      512B       4G         0
│   └─lvmtest_vg-lvmtest_vol             0       64K      64M         0
└─lvmtest_vg-lvmtest_pool_tdata          0      512B       4G         0
  └─lvmtest_vg-lvmtest_pool-tpool        0      512B       4G         0
    ├─lvmtest_vg-lvmtest_pool            0      512B       4G         0
    └─lvmtest_vg-lvmtest_vol             0       64K      64M         0

A thin volume starts with no physical blocks allocated. Discarding an unallocated range does nothing at the dm-thin layer regardless of request size, because there is nothing to return to the pool. To observe granularity enforcement, write one chunk of data first to force an allocation.⁸

dd if=/dev/zero of=/dev/lvmtest_vg/lvmtest_vol bs=64k count=1 oflag=direct

NOTE: This dd command is a contrived example and does not represent a realistic workload. In practice, thin pool chunks are allocated when creating a file system (mkfs), ordinary file writes through a mounted file system, or DRBD replication writing incoming data to the backing volume. Any of those operations would put the LVM thin pool in the same state that the upcoming command sequences demonstrate, that is, some chunks allocated and available for reclamation by a subsequent discard.

Check the thin pool data block usage:

dmsetup status lvmtest_vg-lvmtest_pool-tpool

The sixth field of dmsetup status output shows <used_data_blocks>/<total_data_blocks>. After writing 64 KiB to a pool with a 64 KiB chunk size, used_data_blocks is 1.

0 1048576 thin-pool 1 112/1024 1/8192 - rw discard_passdown queue_if_no_space - 256

Issue a 32 KiB discard (smaller than the 64 KiB chunk size):

blkdiscard -o 0 -l $((32 * 1024)) /dev/lvmtest_vg/lvmtest_vol
echo "exit code: $?"

Output will show no errors:

exit code: 0

Enter the dmsetup status lvmtest_vg-lvmtest_pool-tpool command again. The output should be the same as earlier and show that used_data_blocks stays at 1. The kernel block layer dropped the request at the DISC-GRAN=64K boundary before it reached dm-thin. The exit code is 0.

Next, submit a 64 KiB discard (exactly one chunk) request, aligned to offset 0:

blkdiscard -o 0 -l $((64 * 1024)) /dev/lvmtest_vg/lvmtest_vol
echo "exit code: $?"

Enter another dmsetup status lvmtest_vg-lvmtest_pool-tpool command. Output will show used_data_blocks is now 0. This shows that dm-thin received the full-chunk discard and returned the block to the pool.

0 1048576 thin-pool 1 112/1024 0/8192 - rw discard_passdown queue_if_no_space - 256

Cleaning up

Enter the following commands to clean up the demonstration environment:

LVM_CFG='devices{types=["nullb",16]}'
vgremove --config "$LVM_CFG" -y lvmtest_vg
pvremove --config "$LVM_CFG" /dev/lvmtest
echo 0 > /sys/kernel/config/nullb/lvmtest/power
rmdir /sys/kernel/config/nullb/lvmtest
echo 0 > /sys/kernel/config/nullb/test/power
rmdir /sys/kernel/config/nullb/test
modprobe -r null_blk # if the kernel module not used elsewhere

Conclusion

The safest approach to trimming in a large or virtualized environment is to verify discard support end-to-end through your storage stack by using blktrace before enabling periodic fstrim. You should also treat the fstrim -v byte count as reporting a maximum amount that might be trimmed, and not a confirmation of work done. Comparing the fstrim -v output to the discard sector count in /sys/block/*/stat shows whether file system layer discard submissions produced discards on the physical device.

The volume manager layer drops discard requests smaller than its advertised discard granularity. Effective discard granularity for LVM is the chunk size of LVM thin pools. For ZFS thin volumes, volblocksize sets the effective discard granularity. Neither LVM nor ZFS report dropped discard requests. The lvs -o chunk_size command shows the LVM pool chunk size, and lsblk --discard shows the discard granularity (DISC-GRAN) value each device in the stack advertises to the layer above it.

DRBD might convert discard requests to zero-fill writes rather than passing them through. This depends on backing device discard support and the value of the DRBD disk option discard-zeroes-if-aligned. The drbdsetup show --show-defaults <resource_name> command shows the active DRBD resource configuration. Running blktrace on the DRBD backing device during an fstrim operation can help you determine whether DRBD is passing through discard requests or converting requests to zero-fill writes.

At the physical device layer, a DISC-GRAN value of 0 in lsblk --discard output indicates that the device does not support discarding. Any layer that tries to pass discards through to a non-discarding device will have those requests dropped without error or converted to zero-fill writes.

Written by MAT, 2026-06-05.

Reviewed by LE and MDK, 2026-06-08.

EXT4 trim state is tracked per block group in the bb_state field of struct ext4_group_info in fs/ext4/mballoc.c. This is an in-memory structure. Its state is not persisted and is released on unmount. ↩︎ ↩︎
can_do_reliable_discards() in drbd_receiver.c checks two conditions: bdev_max_discard_sectors(device->ldev->backing_bdev) > 0, and dc->discard_zeroes_if_aligned. Both must hold for DRBD to pass discards through rather than zero-fill. ↩︎
drbd_issue_peer_discard_or_zero_out() in drbd_receiver.c sets the EE_ZEROOUT flag when can_do_reliable_discards() returns false. drbd_issue_discard_or_zero_out() then calls blkdev_issue_zeroout() for the affected range. ↩︎
The kernel block layer formerly exposed a discard_zeroes_data flag intended to indicate whether discards return zeros on readback. However, the flag was unreliable because the behavior it described was poorly specified and some devices reported it incorrectly. Linux kernel developers later removed the in-kernel flag. The discard_zeroes_data sysfs attribute was kept for compatibility but always reports 0. The outcome is that neither older nor current kernels provide a reliable per-device zeroing guarantee through block-layer flags. The default of yes (DRBD_DISCARD_ZEROES_IF_ALIGNED_DEF 1 in drbd-headers/linux/drbd_limits.h) means DRBD proceeds with the alignment-aware discard strategy regardless, zeroing unaligned partial areas itself (see drbd_receiver.c and can_do_reliable_discards()). The LVM thin pool case shows why this is important. In this case, aligned full-chunk discards effectively zero the range when skip_block_zeroing=false, but DRBD cannot determine this from kernel flags. The value yes is therefore an explicit administrator assertion rather than something DRBD can determine automatically. DRBD developers kept the default at yes to preserve the behavior that existing deployments already relied on, rather than changing it without notice. ↩︎
rs_discard_granularity is read in drbd_sender.c only when DRBD_FF_THIN_RESYNC is in the negotiated feature set. When the backing device does not support discard, sanitize_disk_conf() in drbd_nl.c forces it to zero, disabling the feature. ↩︎
When discard_zeroes_if_aligned is no but the backing device supports discard, DRBD calls blkdev_issue_zeroout without the BLKDEV_ZERO_NOUNMAP flag (drbd_receiver.c), permitting the kernel to satisfy the range using UNMAP rather than writes. Whether writes or UNMAP operations reach the device depends on what zeroing mechanisms the backing device advertises. If UNMAP is used, thin-provisioned storage frees rather than consumes space. ↩︎
fallocate is used here because it allocates blocks immediately and deterministically, so the subsequent rm frees real blocks. This matches the common case of deleting a file whose blocks are already committed, which is what triggers re-trimming in practice. Consuming the space with a fresh write instead, for example by using dd, would behave differently, because EXT4 delayed allocation does not allocate blocks until writeback. A file written and then removed before writeback frees nothing, so the write would need to be committed before the removal. The “Re-trimming a live EXT4 file system” section discusses this. ↩︎
All-zero data is acceptable to use in this example because an LVM thin pool allocates a chunk for any write to an unallocated region, regardless of content. However, this pattern does not hold on a ZFS volume with compression enabled, as is common. ZFS stores an all-zero block as a hole (see man zfsprops) and allocates nothing. In this case, a /dev/zero source would not force the allocation. Use an incompressible input, for example, /dev/urandom, when reproducing this example dd command on a ZFS volume. ↩︎