A ZFS developer’s analysis of the good and bad in Apple’s new APFS file system
By Adam Leventhal for the Ars Technica
Apple announced a new file system that will make its way into all of its OS variants (macOS, tvOS, iOS,watchOS) in the coming years. Media coverage to this point has been mostly breathless elongations ofApple’s developer documentation. With a dearth of detail I decided to attend the presentation and Q&A with the APFS team at WWDC. Dominic Giampaolo and Eric Tamura, two members of the APFS team,gave an overview to a packed room; along with other members of the team, they patiently answered questions later in the day. With those data points and some first-hand usage I wanted to provide an overview and analysis both as a user of Apple-ecosystem products and as a long-time operating system and file system developer.
The overview is divided into several sections. I’d encourage you to jump around to topics of interest or skip right to the conclusion (or to the tweet summary). Highest praise goes to encryption; ire to data integrity.
APFS, the Apple File System, was itself started in 2014 with Giampaolo as its lead engineer. It’s a stand-alone, from-scratch implementation (an earlier version of this post noted a dependency on Core Storage, but Giampaolo set me straight in this comment). I asked him about looking for inspiration in other modern file systems such as BSD’s HAMMER, Linux’s btrfs, or OpenZFS (Solaris, illumos,FreeBSD, Mac OS X, Ubuntu Linux, etc.), all of which have features similar to what APFS intends to deliver. (And note that Apple built a fairly complete port of ZFS, though Giampaolo was not apparently part of the group advocating for it.) Giampaolo explained that he was aware of them as a self-described file system guy (he built the file system in BeOS, unfairly relegated to obscurity when Apple opted to purchase NeXTSTEP instead), but didn’t delve too deeply for fear, he said, of tainting himself.
Giampaolo praised the APFS testing team as being exemplary. This is absolutely critical. A common adage is that it takes a decade to mature a file system, and my experience with ZFS more or less confirms this. Apple will be delivering APFS broadly with 3-4 years of development so will need to accelerate quickly to maturity.
Paying down debt
HFS was introduced in 1985 when the Mac 512K (of memory! Holy smokes!) was Apple’s flagship. HFS+, a significant iteration, shipped in 1998 on the G3 PowerMacs with 4GB hard drives. The typical storage capacity of a home computer has increased by a factor of over 1,000 since 1998 (and let’s not even talk about 1985).. HFS+ has been pulled in a bunch of competing directions with different forks for different devices (e.g. I’ve been told by inside sources that the iOS team created their own HFS variant, working so covertly that not even the Mac OS team knew) and different features (e.g. journaling, case sensitivity). It’s old. It’s a mess. And, critically, it’s missing a bunch of features that are really considered basic costs of doing business for most operating systems. Wikipedia lists nanosecond timestamps, checksums, snapshots, and sparse file support among those missing features. Add to that the obvious gap of large device support and you’ve got a big chunk of the APFS feature list.
APFS first and foremost pays down the unsustainable technical debt that Apple has been carrying in HFS+. (In 2001 ZFS grew from a similar need where UFS had been evolved since 1977.) It unifies the multifarious forks. It introduces the expected features. In general it first brings the derelict building up to code.
Compression is an obvious common feature that’s missing in the APFS feature list. It’s conceptually quite easy, I told the development team (we had it in ZFS from the outset), so why not include it? To appeal to Giampaolo’s BeOS nostalgia I even recalled my job interview with Be in 2000 when they talked about how compression actually improved overall performance since data I/O is far more expensive than computation (obvious now, but novel then). The Apple folks agreed, and—in typical Apple fashion—neither confirmed nor denied while strongly implying that it’s definitely a feature we can expect in APFS. I’ll be surprised if compression isn’t included in its public launch.
Encryption is clearly a core feature of APFS. This comes from diverse requirements from the various devices; for example, multiple keys within file systems on the iPhone, or per-user keys on laptops. I heard the term “innovative” quite a bit at WWDC, but here the term is aptly applied to APFS. It supports several different encryption choices for a file system:
- Single-key for metadata and user data
- Multi-key with different choices for metadata, files, and even sections of a file (“extents”)
Multi-key encryption is particularly relevant for portables where all data might be encrypted, but unlocking your phone provides access to an additional key and therefore additional data. Unfortunately this doesn’t seem to be working in the first beta of macOS Sierra (specifying fileEncryption when creating a new volume with diskutil results in a file system that reports “Is Encrypted” as “No”).
Related to encryption, I noticed an undocumented feature while playing around with diskutil (which prompts you for interactive confirmation of the destructive power of APFS unless this is added to the command-line: -IHaveBeenWarnedThatAPFSIsPreReleaseAndThatIMayLoseData; I’m not making this up). APFS (apparently) supports the ability to securely and instantaneously erase a file system with the “effaceable” option when creating a new volume in diskutil. This presumably builds a secret key that cannot be extracted from APFS and encrypts the file system with it. A secure erase then need only delete the key rather than needing to scramble and re-scramble the full disk to ensure total eradication. Various iOS docs refer to this capability requiring some specialized hardware; it will be interesting to see what the option means on macOS. Either way, let’s not mention this to the FBI or NSA, agreed?
Snapshots and backup
APFS brings a much-desired file system feature: snapshots. A snapshot lets you freeze the state of a file system at a particular moment and continue to use and modify that file system while preserving the old data. It does so in a space-efficient fashion where, effectively, changes are tracked and only new data takes up additional space. This has the potential to be extremely valuable for backup by efficiently tracking the data that has changed since the last backup.
ZFS includes snapshots and serialization mechanisms that make it efficient to back up file systems or transfer file systems to a remote location. Will APFS work like that? Probably not, answered Giampaolo. ZFS sends all changed data, while Time Machine can have exclusion lists and the like. That seems surmountable, but we’ll see what Apple does. APFS right now is incompatible with Time Machine due to the lack of directory hard links, a fairly disgusting implementation that likely contributes to Time Machine’s questionable reliability. Hopefully APFS will create some efficient serialization for Time Machine backup.
While APFS dev manager Eric Tamura demonstrated snapshots at WWDC, the required utilities aren’t included in the macOS Sierra beta. I used DTrace (technology I’m increasingly amazed that Apple ported from OpenSolaris) to find a tantalizingly named new system call fs_snapshot; I’ll leave it to others to reverse engineer its proper use.
APFS brings another new feature known as “space sharing.” A single APFS “container” that spans a device can have multiple “volumes” (file systems) within it. Apple contrasts this with the static allocation of disk space to support multiple HFS+ instances, which seems both specious and an uncommon use case. Both ZFS and btrfs have a similar concept of a shared pool of storage with nested file systems for administration and management.
Speaking with Giampaolo and other members of the APFS team, we discussed how volumes are the unit by which users can control things like snapshots and encryption. You’d want multiple volumes to correspond with different policies around those settings. For example, while you might want to snapshot and backup your system each day, the massive /private/var/vm/sleepimage (for saving memory when hibernating) should live on its own and not be backed up.
Space sharing is more like an operational detail than a game changing feature. You can think of it like special folders with snapshot and encryption controls—which is probably why Apple’s marketing department has yet to make me a job offer. Adding new volumes can fail with an opaque error (does -69625 mean anything to you?), but using a larger disk image resolve the problem.
A modern trend in file systems has been to store data more efficiently to effectively increase the size of your device. Common approaches to this include compression (which, as noted above, is very very likely coming) and deduplication. Dedup finds common blocks and avoids storing them multiply. This is potentially highly beneficial for file servers where many users or many virtual machines might have copies of the same file; it’s probably not useful for the single-user or few-user environments that Apple cares about. (Yes, they have server-ish offerings but their heart clearly isn’t into it.) It’s also furiously hard to do well, as I painfully learned while supporting ZFS.
Apple’s number one goal is to avoid the beach ball of doom. APFS addresses this with I/O QoS to prioritize accesses that are immediately visible to the user over background activity that doesn’t have the same time-constraints.
Apple’s sort-of-unique contribution to space efficiency is constant-time cloning of files and directories. As a quick aside, “files” in macOS are often really directories; it’s a convenient lie they tell to allow logically related collections of files to be treated as indivisible units. Right-click an application and select “Show Package Contents” to see what I mean. Accordingly, I’m going to use the term “file” rather than “file or directory” in sympathy for the patient readers who have made it this far.
With APFS, if you copy a file within the same file system (or possibly the same container; more on this later), no data is actually duplicated. Instead, a constant amount of metadata is updated and the on-disk data is shared. Changes to either copy cause new space to be allocated (this is called “copy on write,” or COW). btrfs also supports this and calls the feature “reflinks”—link by reference.
I haven’t seen this offered in other file systems (btrfs excepted), and it clearly makes for a good demo, but it got me wondering about the use case. Copying files between devices (e.g. to a USB stick for sharing) still takes time proportional to the amount of data copied of course. Why would I want to copy a file locally? The common case I could think of is the layman’s version control: “thesis,” “thesis-backup,” “thesis-old,” “thesis-saving because I’m making edits while drunk.”
There are basically three categories of files:
- Files that are fully overwritten each time; images, MS Office docs, videos, etc.
- Files that are appended to, like log files
- Files with a record-based structure, such as database files
For the average user, most files fall into that first category. So with APFS I can make a copy of my document and get the benefits of space sharing, but those benefits will be eradicated as soon as I save the new revision. Perhaps users of larger files have a greater need for this and have a better idea of how it might be used.
Personally, my only use case is taking a file—say time-shifted Game of Thrones episodes falling into the “fair use” section of copyright law—and sticking it in Dropbox. Currently I need to choose to make a copy or permanently move the file to my Dropbox folder. Clones would let me do this more easily. But then, so would hard links (a nearly ubiquitous file system feature that lets a file appear in multiple directories).
Clones open the door for potential confusion. While copying a file may take up no space, so too deleting a file may free no space. Imagine trying to free space on your system, and needing to hunt down the last clone of a large file to actually get your space back.
APFS engineers don’t seem to have many use cases in mind; at WWDC they asked for suggestions from the assembled developers (the best I’ve heard is for copied virtual machines, which is not exactly a mass-market problem). If the focus is generic revision control, I’m surprised that Apple didn’t shoot for a more elegant solution. One could imagine functionality with APFS that allows a user to enable per-file Time Machine—change tracking for any file. This would create a new type of file where each version is recorded transparently and automatically. You could navigate to previous versions, prune the history, or delete the whole pile of versions at once (with no stray clones to hunt down). In fact, Apple introduced something related 5 years ago, but I’ve literally never seen or heard of it until researching this post (show of hands if you’ve clicked “Browse All Versions…”). APFS could clean up its implementation, simplify its use, and bring generic support for all applications. None of this solves my Game of Thronesstorage problem, but I’m not even sure it’s much of a problem…
Side note: Finder copy creates space-efficient clones, but cp from the command line does not.
APFS claims to be optimized for flash. Flash memory (NAND) is the stuff in your speedy SSD. Apple changed the computing industry when it put flash into the iPod and iPhone, volumes for which fundamentally changed the economics of flash. This consumer change impacted the enterprise (as it often does), giving rise to hybrid and all-flash arrays. Ten years ago flash cost as much as DRAM; now it’s challenging the economics of hard disks.
SSDs mimic the block interface of conventional hard drives, but the underlying technology is completely different. In particular, while magnetic media can read or write sectors arbitrarily, flash erases large chunks (blocks) and reads and writes smaller chunks (pages). The management is done by what’s called the flash translation layer (FTL), software that makes blocks and pages appear more like a hard drive. (Editor’s note: we’ve got a huge, 10,000 word breakdown on how SSDs work if you’d like to know more). An FTL is very similar to a file system, creating a virtual mapping (a translation) between block addresses and locations within the media. Apple controls the full stack, including the SSD, FTL, and file system; they could have built something differentiated, optimizing this components to work together. What APFS does, however, is simply write in patterns known to be more easily handled by NAND. It’s a file system with flash-aware characteristics rather than one written explicitly for the native flash interfaces—more or less what you’d expect in 2016.