ZFS fans, rejoice – RAIDz expansion will be a reality very soon | GeekComparison

OpenZFS supports many complex disk topologies, but "spiral pile sitting on a desk" is still not one of them.
Enlarge / OpenZFS supports many complex disk topologies, but “spiral stack sitting on a desk” is still not one of them.

Jim Salter

OpenZFS founding developer Matthew Ahrens opened a PR last week for one of the most sought-after features in ZFS history – RAIDz expansion. The new feature allows a ZFS user to expand the size of a single RAIDz vdev. For example, you can use the new feature to convert a three-disk RAIDz1 into a four-, five-, or six-disk RAIDz1.

OpenZFS is a complex file system and things are necessarily going to get a little tricky explaining how the feature works. So if you’re a ZFS newbie, you might want to head back to our comprehensive ZFS 101 introduction.

Expand storage in ZFS

In addition to being a file system, ZFS is also a storage array and volume manager, which means you can give it a whole bunch of disk devices, not just one. The heart of a ZFS storage system is the zpool—this is the most basic level of ZFS storage. The zpool in turn contains vdevsAnd vdevs contain real discs inside. Writes are broken down into units that are named records or blockswhich are then semi-evenly distributed over the vdevs.

A storage vdev can be one of five types: single disk, mirror, RAIDz1, RAIDz2or RAIDz3. You can add more vdevs to a zpooland you can attach more disks to a single or mirror vdev. But managing storage in this way requires some planning and budgeting, which hobbyists and home workers are often less than thrilled about.

Conventional RAID, which does not share the “pool” concept with ZFS, generally provides the ability to extend and/or reshape an existing array. For example, you can add a single disk to a six-disk disk RAID6 array, turning it into a seven-drive array RAID6 series. Going through a live reshaping can be quite painful, especially on near full arrays; it is quite possible that such a task could take a week or more, with the performance of the array limited to a quarter or less of normal all the time.

Historically, ZFS has shunned this kind of expansion. ZFS was originally developed for business use, and reshaping live arrays is generally a non-starter in the business world. Allowing your storage’s performance to degrade to unusable levels for days on end generally costs more labor costs and overheads than buying a whole new set of hardware. Live expansion is also potentially very dangerous, as it involves reading and rewriting all the data and placing the array in a temporary and much less well-tested “half this, half that” state until it is complete.

For users with many drives, it’s new RAIDz extension probably won’t materially change how they use ZFS. It will still be both easier and more practical to manage vdevs as complete units instead of trying to mess with them. But hobbyists, home workers and small users who run ZFS with a single vdev will probably make a lot of use of the new feature.

How does it work?

In this slide we see a four-disk RAIDz1 (left) expanded to a five-disk RAIDz1 (right).  Note that the data is still written in four wide stripes!
Enlarge / In this slide we see a four-disk RAIDz1 (left) expanded to a five-disk RAIDz1 (right). Note that the data is still written in four wide stripes!

Practically speaking, Ahrens’ is new vdev extension function only adds new capabilities to an existing command viz zpool attachwhich is normally used to add a disk to a single disk vdev (change to a mirror vdev) or add an extra disk to a mirror (e.g. spinning a disc with two discs mirror in a three-disc mirror).

You can with the new code attach new drives to an existing one RAIDz vdev too. Doing this expands the vdev widthwise, but the vdev type, so you can run a six-disk RAIDz2 vdev in a disk of seven RAIDz2 vdev, but you can not turn it into a seven-disc drive RAIDz3.

When you issue your zpool attach command, the expansion begins. During expansion, each block or record is read from the vdev is expanded and then rewritten. The sectors of the rewritten block are distributed across all disks in the vdev, including the new disk(s), but the width of the stripe itself is not changed. So a RAIDz2 vdev expanded from six checkers to ten will still be full with six wide stripes after the expansion is complete.

So while the user will see the extra space made available by the new drives, the storage efficiency of the extended data will not be improved with the new drives. In the example above, we assumed a disk with six disks RAIDz2 with a nominal storage efficiency of 67 percent (four out of every six sectors is data) to a drive of ten RAIDz2. Facts new written to the ten-disk RAIDz2 has a nominal storage efficiency of 80 percent — eight of every ten sectors is data — but the old extended data is still written in six wide stripes, so it still has the old storage efficiency of 67 percent.

It’s worth noting that this isn’t an unexpected or bizarre state for a vdev to find itself in—RAIDz already uses a dynamic, variable bar width to account for blocks or records too small to streak across all discs in one vdev.

For example, if you write a single block of metadata (the data that includes the file name, permissions, and location on disk), it fits within a single sector on disk. If you write that metadata block to a ten-wide RAIDz2you don’t write a full ten broad stroke – instead you write a substandard one block only three discs wide; a single data sector plus two parities sectors. So the “undersized” blocks in a newly expanded RAIDz vdev is not one for ZFS to mess up. They are just another day at the office.

Is there a lasting impact on performance?

As we discussed above, a new extended RAIDz vdev won’t quite look like a design designed that way from “birth” – at least not at first. Although there are more disks in the mix, the internal structure of the data does not change.

Adding one or more new disks to the vdev means it should be capable of slightly higher throughput. Although the legacy blocks do not span the entire width of the vdev, the added sheaves mean more spindles to distribute the work around. However, this probably won’t provide a stunning increase in speed: six wide stripes on a seven-disc disk vdev means you still can’t read or write two blocks simultaneously, so any speed improvements are likely to be minor.

The net impact on performance can be difficult to predict. If you are expanding from a six-disk drive RAIDz2 to a disk of seven RAIDz2for example, your original six-disk configuration did not need padding. A 128KiB block can be sliced ​​evenly into four 32KiB data pieces, with two 32KiB parity pieces. The same record divided among seven disks need padding because 128KiB/five pieces of data does not add up to an even number of sectors.

Likewise, in some cases, especially with a small one recordsize or volblocksize set: The workload per individual disk can be significantly less challenging in the older, narrower layout than in the newer, wider layout. A 128KiB block split into pieces of 32KiB for a six-wide RAIDz2 can be read or written more efficiently per disk then one split into 16KiB pieces for a ten wide RAIDz2for example – so it’s kind of a nonsense whether more disks but smaller pieces will provide more throughput than fewer disks but bigger pieces.

The only thing you can be sure of is that the newly extended configuration will generally perform as well as the original non-extended version – and that once the majority of the data is (re)written in the new width, the extended vdev will not perform any differently, or be less reliable than one designed that way from the start.

Why not reform records/blocks during expansion?

It may seem strange that the initial expansion process does not rewrite all existing ones blocks to the new width while it is running – it reads and rewrites the data after all, right? We asked Ahrens why the original width was left as it is, and the answer boils down to “it’s easier and safer that way.”

An important factor to recognize is that technically the extension not moving blocks; it just moves sectors. As it is written, the extension code does not need to know where ZFS makes sense block boundaries are – the extension routine has no idea or an individual sector is parity or data, let alone which one block it is from.

Expansion can traverse all block clues to locate block limits, and Than it would know which one sector belongs to what block and how to change the shape block, but according to Ahrens it would be extremely drastic for the on-disk format of ZFS. The extension should be constantly updated spacemaps on metaslabs to account for changes in the disk size of each block– and if the block is part of one dataset rather than one zvolalso update the accounting per dataset and per filespace.

If your teeth get really itchy knowing you have four wide stripes on a only five wide vdev, you can just read and rewrite your data yourself after the expansion is complete. The easiest way to do this is to use zfs snapshot, zfs sendAnd zfs receive completely replicate datasets And zvols. If you’re not worried about ZFS properties, a simple mv surgery will succeed.

However, in most cases we recommend that you just relax and let ZFS do its thing. You below par blocks of older data are not really harmful, and since you are of course deleting and/or modifying data during the lifetime of the vdevmost of them will be rewritten naturally as needed, without administrator intervention or long periods of high storage load due to obsessively reading and rewriting everything at once.

When will the RAIDz expansion come into production?

Ahrens’ new code is not yet part of an OpenZFS release, let alone added to anyone else’s repositories. We’ve asked Ahrens when we can expect the code to be in production, and unfortunately it’s still a while away.

It’s too late to include RAIDz extension in the upcoming OpenZFS 2.1 release, which is expected very soon (2.1 release candidate 7 is now available). It should be included in the next major OpenZFS release; it’s too early for concrete dates, but major releases usually happen about once a year.

In general, we expect the RAIDz extension to go into production sometime around August 2022, like Ubuntu and FreeBSD, but that’s just a guess. TrueNAS may very well get it into production sooner, as ixSystems tends to pull ZFS features off the master before they officially reach release status.

Matt Ahrens presented the RAIDz extension at the FreeBSD Developer Summit – his talk starts at 1 hour 41 minutes into this video.

Leave a Comment