
Jim Salter
OpenZFS founding developer Matthew Ahrens opened a PR last week for one of the most sought-after features in ZFS history – RAIDz expansion. The new feature allows a ZFS user to expand the size of a single RAIDz vdev. For example, you can use the new feature to convert a three-disk RAIDz1 into a four-, five-, or six-disk RAIDz1.
OpenZFS is a complex file system and things are necessarily going to get a little tricky explaining how the feature works. So if you’re a ZFS newbie, you might want to head back to our comprehensive ZFS 101 introduction.
Expand storage in ZFS
In addition to being a file system, ZFS is also a storage array and volume manager, which means you can give it a whole bunch of disk devices, not just one. The heart of a ZFS storage system is the zpool
—this is the most basic level of ZFS storage. The zpool
in turn contains vdevs
And vdevs
contain real discs inside. Writes are broken down into units that are named records
or blocks
which are then semi-evenly distributed over the vdevs
.
A storage vdev
can be one of five types: single disk, mirror, RAIDz1
, RAIDz2
or RAIDz3
. You can add more vdevs
to a zpool
and you can attach
more disks to a single or mirror vdev
. But managing storage in this way requires some planning and budgeting, which hobbyists and home workers are often less than thrilled about.
Conventional RAID
, which does not share the “pool” concept with ZFS, generally provides the ability to extend and/or reshape an existing array. For example, you can add a single disk to a six-disk disk RAID6
array, turning it into a seven-drive array RAID6
series. Going through a live reshaping can be quite painful, especially on near full arrays; it is quite possible that such a task could take a week or more, with the performance of the array limited to a quarter or less of normal all the time.
Historically, ZFS has shunned this kind of expansion. ZFS was originally developed for business use, and reshaping live arrays is generally a non-starter in the business world. Allowing your storage’s performance to degrade to unusable levels for days on end generally costs more labor costs and overheads than buying a whole new set of hardware. Live expansion is also potentially very dangerous, as it involves reading and rewriting all the data and placing the array in a temporary and much less well-tested “half this, half that” state until it is complete.
For users with many drives, it’s new RAIDz
extension probably won’t materially change how they use ZFS. It will still be both easier and more practical to manage vdevs
as complete units instead of trying to mess with them. But hobbyists, home workers and small users who run ZFS with a single vdev
will probably make a lot of use of the new feature.
How does it work?

Practically speaking, Ahrens’ is new vdev
extension function only adds new capabilities to an existing command viz zpool attach
which is normally used to add a disk to a single disk vdev
(change to a mirror vdev
) or add an extra disk to a mirror
(e.g. spinning a disc with two discs mirror
in a three-disc mirror
).
You can with the new code attach
new drives to an existing one RAIDz
vdev too. Doing this expands the vdev widthwise, but the vdev
type, so you can run a six-disk RAIDz2
vdev in a disk of seven RAIDz2
vdev, but you can not turn it into a seven-disc drive RAIDz3
.
When you issue your zpool attach
command, the expansion begins. During expansion, each block
or record
is read from the vdev
is expanded and then rewritten. The sectors of the rewritten block
are distributed across all disks in the vdev
, including the new disk(s), but the width of the stripe itself is not changed. So a RAIDz2 vdev
expanded from six checkers to ten will still be full with six wide stripes after the expansion is complete.
So while the user will see the extra space made available by the new drives, the storage efficiency of the extended data will not be improved with the new drives. In the example above, we assumed a disk with six disks RAIDz2
with a nominal storage efficiency of 67 percent (four out of every six sectors is data) to a drive of ten RAIDz2
. Facts new written to the ten-disk RAIDz2 has a nominal storage efficiency of 80 percent — eight of every ten sectors is data — but the old extended data is still written in six wide stripes, so it still has the old storage efficiency of 67 percent.
It’s worth noting that this isn’t an unexpected or bizarre state for a vdev to find itself in—RAIDz
already uses a dynamic, variable bar width to account for blocks
or records
too small to streak across all discs in one vdev
.
For example, if you write a single block of metadata (the data that includes the file name, permissions, and location on disk), it fits within a single sector
on disk. If you write that metadata block to a ten-wide RAIDz2
you don’t write a full ten broad stroke – instead you write a substandard one block
only three discs wide; a single data sector
plus two parities sectors
. So the “undersized” blocks
in a newly expanded RAIDz
vdev is not one for ZFS to mess up. They are just another day at the office.
Is there a lasting impact on performance?
As we discussed above, a new extended RAIDz vdev
won’t quite look like a design designed that way from “birth” – at least not at first. Although there are more disks in the mix, the internal structure of the data does not change.
Adding one or more new disks to the vdev
means it should be capable of slightly higher throughput. Although the legacy blocks
do not span the entire width of the vdev
, the added sheaves mean more spindles to distribute the work around. However, this probably won’t provide a stunning increase in speed: six wide stripes on a seven-disc disk vdev
means you still can’t read or write two blocks
simultaneously, so any speed improvements are likely to be minor.
The net impact on performance can be difficult to predict. If you are expanding from a six-disk drive RAIDz2
to a disk of seven RAIDz2
for example, your original six-disk configuration did not need padding. A 128KiB block
can be sliced evenly into four 32KiB data pieces, with two 32KiB parity pieces. The same record divided among seven disks need padding because 128KiB/five pieces of data does not add up to an even number of sectors.
Likewise, in some cases, especially with a small one recordsize
or volblocksize
set: The workload per individual disk can be significantly less challenging in the older, narrower layout than in the newer, wider layout. A 128KiB block
split into pieces of 32KiB for a six-wide RAIDz2
can be read or written more efficiently per disk then one split into 16KiB pieces for a ten wide RAIDz2
for example – so it’s kind of a nonsense whether more disks but smaller pieces will provide more throughput than fewer disks but bigger pieces.
The only thing you can be sure of is that the newly extended configuration will generally perform as well as the original non-extended version – and that once the majority of the data is (re)written in the new width, the extended vdev
will not perform any differently, or be less reliable than one designed that way from the start.
Why not reform records/blocks during expansion?
It may seem strange that the initial expansion process does not rewrite all existing ones blocks
to the new width while it is running – it reads and rewrites the data after all, right? We asked Ahrens why the original width was left as it is, and the answer boils down to “it’s easier and safer that way.”
An important factor to recognize is that technically the extension not moving blocks
; it just moves sectors
. As it is written, the extension code does not need to know where ZFS makes sense block
boundaries are – the extension routine has no idea or an individual sector
is parity or data, let alone which one block
it is from.
Expansion can traverse all block
clues to locate block
limits, and Than it would know which one sector
belongs to what block
and how to change the shape block
, but according to Ahrens it would be extremely drastic for the on-disk format of ZFS. The extension should be constantly updated spacemaps
on metaslabs
to account for changes in the disk size of each block
– and if the block
is part of one dataset
rather than one zvol
also update the accounting per dataset and per filespace.
If your teeth get really itchy knowing you have four wide stripes on a only five wide vdev, you can just read and rewrite your data yourself after the expansion is complete. The easiest way to do this is to use zfs snapshot
, zfs send
And zfs receive
completely replicate datasets
And zvols
. If you’re not worried about ZFS properties, a simple mv
surgery will succeed.
However, in most cases we recommend that you just relax and let ZFS do its thing. You below par blocks
of older data are not really harmful, and since you are of course deleting and/or modifying data during the lifetime of the vdev
most of them will be rewritten naturally as needed, without administrator intervention or long periods of high storage load due to obsessively reading and rewriting everything at once.
When will the RAIDz expansion come into production?
Ahrens’ new code is not yet part of an OpenZFS release, let alone added to anyone else’s repositories. We’ve asked Ahrens when we can expect the code to be in production, and unfortunately it’s still a while away.
It’s too late to include RAIDz extension in the upcoming OpenZFS 2.1 release, which is expected very soon (2.1 release candidate 7 is now available). It should be included in the next major OpenZFS release; it’s too early for concrete dates, but major releases usually happen about once a year.
In general, we expect the RAIDz extension to go into production sometime around August 2022, like Ubuntu and FreeBSD, but that’s just a guess. TrueNAS may very well get it into production sooner, as ixSystems tends to pull ZFS features off the master before they officially reach release status.
Matt Ahrens presented the RAIDz extension at the FreeBSD Developer Summit – his talk starts at 1 hour 41 minutes into this video.