Deduptar — the tar that deduplicates

TAR, a prevalent archive format on Unix-like systems, has unarchived in the following way since decades: 1) read file entries from a filesystem, 2) write their metadata (file name, permissions and such) to the archive file as a header, 3) followed by a copy of the file contents. Unpacking the archive again involves copying the file data out again. In a nutshell, a bunch of copying.

So if you're about to tar up a gigabyte of files into a such an archive, you'll need at least one gigabyte of free space to store the resultant tarball. Conversely, when you unpack that tarball, you'll need a gigabyte of disk space to store the files you unpacked.

But not inevitably so! If you are on Linux, and are using a fancy filesystem type — BTRFS and probably XFS, currently — then Deduptar will pack files into tar archives, and extract files from them, not by copying, but by using the FICLONERANGE ioctl (available from Linux kernel 4.5 and up). Basically, for adding a file to the tarball, deduptar will tell the filesystem "those bytes of that file there? Pretend that they're part of this file here, too". And for extracting a file it does it the other way around. Thus the file data is not copied, but rather, another reference to already-existing already-referenced data is created — resulting in far less I/O and disk space usage.

That sounds simple enough1. The trick, though, is in padding out entries in the tarfile so that the file contents parts of the tarfile start exactly on filesystem page boundaries — otherwise they can't be cloned in or out. And the tar format doesn't support that kind of padding. OR DOES IT — answer: it does! One can craft PAX header comments of the exact right size to push an entry's data section to the next page start.

The README has more details as well as installation instructions. It's not an exact drop-in for GNU tar (for instance, deduptar doesn't support tape changers like GNU tar does), but it has pretty much everything you'd need for everyday taring. In fact the tests assert bidirectional concurrence with GNU tar.

Footnotes


  1. It's suprisingly hard to make an actually decent tar. We have to deal with hardlinks that need to be reproduced, and symlinks which can cause cycles in the filesystem "tree" (making it not really a tree, when traversing one upon archiving or extraction). And especially making unarchival resistant to symlink attacks is tricky, but at least we have the openat2() system call now. 

Cinematic explanation

Here's the cinematic version of the README, shot at All Systems Go 2023: