Dupmerge and Rmdups
Phil Karn, KA9Q

<title>Dupmerge and Rmdups</title>
<center><h1>Dupmerge and Rmdups<br>Phil Karn, KA9Q</h1></center>

These two command-line utilities are designed to help you clean out a file system
that has become cluttered with duplicate files. If you're like
me, you often make ad-hoc backups of a directory hierarchy
with <b>tar</b>, <b>rsync</b> or <b>cpio</b> that tend to sit around forever.
Or you keep much the same data on more than one system (with one typically
being a laptop), and occasionally you want to consolidate the two collections.

<p>These utilities can help you do that. There are two: <b>dupmerge</b>, which reclaims the space used by redundant copies,
and <b>rmdups</b> which eliminates redundant path names pointing to a common file.

<p>
<hr>
<a href="dupmerge.1">dupmerge man page</a>
<br>
<a href="dupmerge.c">dupmerge.c source code</a>
<br>
<a href="dupmerge-osx">dupmerge binary for OSX Snow Leopard (64 bit)</a>
<br>
<a href="dupmerge-linux-i386">dupmerge binary for i386 Linux</a>
<hr>
<a href="rmdups.1">rmdups man page</a>
<br>
<a href="rmdups.c">rmdups.c source code</a>
<br>
<a href="rmdups-osx">rmdups binary for OSX Snow Leopard (64 bit)</a>
<br>
<a href="rmdups-linux-i386">rmdups binary for i386 Linux</a>
<hr>

<h3>Unix Hard Links</h3>

To use either command properly you <em>must</em> be familiar with the concept of the
<a href="http://en.wikipedia.org/wiki/Hard_link">hard link</a>, so I'll provide a quick refresher.
It originated in
<a href="http://en.wikipedia.org/wiki/UNIX">UNIX</a>, so it has long been part of UNIX clones like
<a href="http://en.wikipedia.org/wiki/Linux">Linux</a>,
<a href="http://en.wikipedia.org/wiki/BSD">BSD</a>, and
<a href="http://en.wikipedia.org/wiki/Mac OSX">Mac OSX</a>.
It may have also been picked up by other operating systems with which I am unfamiliar.

<p>It might not seem like it, but UNIX files do not actually have names; internally they are uniquely identified by
device and
<a href="http://en.wikipedia.org/wiki/inode">inode</a> numbers.
The user refers to them with
<a href="http://en.wikipedia.org/wiki/Path (computing)">path names</a>,
printable strings that define a series of directories delimited by slashes (/) -- a path -- that ultimately point to the inode
and the file's metadata and contents.
Every file (inode) has at least one path name pointing to it.
When more than one path name points to the same inode, they are all
equally valid; there is no preferred or primary path name.
These path names are sometimes referred to as "hard links", as
distinguished from the
<a href="http://en.wikipedia.org/wiki/Symbolic link">"soft" or "symbolic" link</a>, a special kind of file that contains
a path name (possibly one of several hard links!) pointing to another file.

<p>
The <b>dupmerge</b> command works by reading a list of path names, identifying those that point to distinct inodes with the same content, deleting all but one,
and recreating the path names for the deleted copies as hard links to the remaining copy.
When it finishes, all the path names given to the program still exist, and each provides the same content when read.
However, the redundant copies have been deleted and the space reclaimed by the filesystem free list.
<p>
Here is a simple shell command that invokes dupmerge on all the files in the current directory and under any and all subdirectories:
<p>
<b>find . -print0 | dupmerge -0</b>
<p>
<b>dupmerge</b> automatically ignores all directories, special device files, FIFOs -- everything that isn't an ordinary file -- so you don't have to
bother with any fancy arguments to the <b>find</b> command.
When it's done, every path name will still be present and yield the same contents when accessed.
But any path names that had referred to separate copies of the same data are now hard links to a single
copy.
<p>
Running dupmerge can still make some visible differences and it is important to understand them before
you use the program on an important file system because
there is, as yet, no way to reverse the effects of <b>dupmerge</b>.
In UNIX, a file's metadata belongs to the inode, not the path name. The metadata includes
the ownership and group membership of the file, its access permissions, and the times of its creation and (if enabled) last access and inode modification.
Since only one inode in each group of inodes with the same contents will survive an execution of <b>dupmerge</b>,
file ownerships, permissions and modification timestamps may change for certain pathnames even though the <em>contents</em> remain unchanged.
The metadata on the deleted copies of the file is currently lost, though a future version of dupmerge may save this data in an "undo" file so that its
effects could be undone.
Until then, it is best to limit dupmerge to directory hierarchies whose files have a single owner and group, and to anticipate any problems
that might be caused by the loss of metadata on the redundant copies that will be deleted.
<b>Dupmerge</b> always keeps the oldest copy among a group of identical files on the principle that the oldest timestamp is most likely the actual
time that the data was created, with the newer timestamps merely reflecting the time that each copy was made from the original.

<p>
Because empty files are often used as system lock files, with significance in the inode metadata, <b>dupmerge</b> ignores them by default.
(There's little to reclaim by deleting an empty file anyway.)

<h3>Dupmerge Options</h3>

<h4>-0</h4>By default, each file name is terminated with either a newline or a null. This option specifies that each file name can only be
terminated with a null. (Use find's -print0 option to generate such a list.)

<h4>-z</h4>By default, empty files are ignored. They are often used as locks, and disturbing them could cause
problems. Besides, empty files have no assigned blocks so there's not much to be gained by deleting them (other than the inode entry, of course).
This option overrides this behavior; when set, all empty files will be linked together.

<h4>-s</h4>By default, the list of files is sorted by decreasing size
and the largest files are examined first.

<p>I often run dupmerge when a file system is out of (or is about to run out of) free space,
so scanning the largest files first usually recovers space more quickly.
I can't think of a reason to examine the smallest files first, but if you can this option will do it.

<h4>-n</h4>This flag tells dupmerge to conduct a dry run.
It will appear to operate normally, but without actually
unlinking or relinking any files.

<h4>-q</h4>Suppress the progress messages that dupmerge normally generates, including the unlink/link commands generated for each
duplicate file. Error messages are still displayed. This flag is forced off by the -n flag.

<h4>-f</h4>Enable a shortcut comparison heuristic that can substantially speed up the program
under certain special cases. When set, a pair of files with the same basename (the part of the path name after the last '/'), the same size,
and the same modification date will be considered identical without actually comparing their contents.
This can be an important special case as many duplicates are created by copying a directory hierarchy with a utility like
rsync, tar or cpio that preserves modification times.
The widely used rsync program uses this heuristic by default so it doesn't seem terribly unsafe, but keep in mind the (small) risk that two
different files might be erroneously considered identical.
For this reason I recommend avoiding this flag when possible; the default file comparison strategy is actually quite fast.

<h4>-t</h4>The probability that two different files will have the same size decreases with increasing file size.
To reduce the risk of false matches with the -f option, files less than a certain size will nonetheless be compared by their hash codes
even when -f is selected.
The default size threshold is 100,000 bytes; this option allows another threshold to be set.


<h3>Notes on dupmerge</h3>
My first version of this program circa 1993 worked by computing MD5
hashes of every file, sorting the hashes and then looking for duplicates.
This worked but it was unnecessarily slow.
One reason was that it computed a hash for every file, including those with unique sizes
that couldn't possibly have any duplicates
(duplicate files always have the same size!)

<p>My <a href="dupmerge2.c">second version</a> circa 1999 unlinked the duplicates
as a side effect of the sort comparison function.

<p>I have since rewritten it again from scratch. It begins by sorting the file list by size.
Files with unique sizes are ignored;
two or more files having the same size are compared by their SHA-1 hashes.
Comparing hashes, as opposed to actual file contents, has significant performance benefits in most real-world workloads.
Each file can be read sequentially, without any disk seeks, and its hash cached so that it need only be read once no matter how many other files
have the same size.

<p>But there are nonetheless pathological situations where this algorithm performs much more slowly than direct file-to-file comparison.
Consider a collection of a thousand unique files, each exactly 10 megabytes in size.
The hash comparison strategy involves reading every file in its entirety while a direct
file-to-file comparison can abort as soon as the first difference is detected. Most files, if different, differ near their beginnings.

<p>To handle this possibility with reasonable efficiency, comparing a pair of files actually entails a series of steps.
First, both files must have the same size and be on the same file system.
(Hard links cannot extend across file systems.)
Second, I compute and compare the SHA-1 hash of the first page (4 KB) of each file.
This will catch most differing files.
If, and only if, the first 4KB of each file have the same hash do I proceed to compute and compare the SHA-1 hash for each entire file.
This performs well in practice because most files that differ at all will do so in the first 4KB.
It is unusual for two files to have the same size and the
same first page, yet differ beyond that.
However, it is certainly possible, which is why it is still necessary to compare the complete hashes before declaring two files to be identical.

<p>
Every hash result (on the leading page or over the full file) is cached so it never has to be computed more than once.
This improves performance substantially when there are many files of the same size.
<p>
The program displays copious statistics at the end of execution, such as the number of disk blocks reclaimed from duplicates,
the number of first-page and complete-file hash function computations,
the number of hash "hits" (references to hash values that have already been computed),
and the number of same-size file pairs whose first page hashes match but differ in their full-file hashes.

<hr>
<h3>Rmdups</h3>
(Documentation incomplete.) This is a companion to <b>dupmerge</b> that can be used to get rid of unwanted hard links.
<p>
Like <b>dupmerge</b>, <b>rmdups</b> also accepts a list of path names from standard input, e.g.,
<p>
<b>find . -print0 | rmdups -0</b>
<p>
It takes only one argument, -0, that specifies that each input path name is null terminated.
I strongly recommend this, along with the <b>-print0</b> option to <b>find</b>.
<p>
Hard links to common files are identified and grouped.
If the input includes every hard link to some particular file, they are displayed and the user is prompted as to which he would like to keep;
all the others are unlinked.

But if the input contains only <em>some</em> of the path names for a given file, <em>those that are cited are automatically unlinked without user intervention</em>.

<p>This feature seems more dangerous than it really is, but it's what makes <b>rmdups</b> useful when cleaning up directories of redundant entries.
First of all, <b>rmdups</b> <em>never</em>, under <em>any</em> circumstances, unlinks the last remaining path name to a file (which is what actually deletes a file in a UNIX-like file system.)

<p>
Consider two directories, 'work' and 'tmp'. Some time ago you created 'tmp' as a snapshot of 'work', and now you'd like to clean things up
and merge everything back into the 'work' directory. You have already executed
<p>
<b>find work tmp -print0 | dupmerge -0</b>
<p>
so they contain links to a more or less common set of files.

Now you'd like to get rid of the 'tmp' directory, but the latter also has a few unique files that you'd like
to keep so you don't want to issue a simple <b>rm -rf tmp</b>.
The command <b>find tmp -print0 | rmdups -0</b> automatically clears every path name under 'tmp' that points to a file also referred to from 'work',
leaving any singly-linked files under 'tmp' untouched so you
can go through them by hand.

<p>
Consider this sequence:
<p>
<b>find . -print0 | dupmerge
<br>find . -print0 | rmdups
</b>
<p>
This sequence first identifies duplicate files and reclaims the space in the redundant copies.
Then it lists all the path names that point to each file, prompting the user to select one to keep.

<p>
<b>rmdups</b> is much less mature than <b>dupmerge</b>, and it needs more work. Right now, for example,
there is no way to specify that all of the path names for a file should be kept without simply quitting the program.

<p>
Both <b>dupmerge</b> and <b>rmdups</b>, by their nature, are programs that can potentially do a lot of damage to a file system should there prove to be
any bugs.
With this in mind, I've included a lot of rather paranoid assert checks that will cause the program to abort should something not be as expected.
I have used <b>dupmerge</b> for quite some time without any problems, but I wouldn't bet my life on their being none. Should you run into any bugs,
please let me know ASAP so I can fix them.
<p>

Phil Karn, KA9Q, karn@ka9q.net

<p><em>Last updated: 7 May 2010</em>