These utilities can help you do that. There are two: dupmerge, which reclaims the space used by redundant copies, and rmdups which eliminates redundant path names pointing to a common file.
It might not seem like it, but UNIX files do not actually have names; internally they are uniquely identified by device and inode numbers. The user refers to them with path names, printable strings that define a series of directories delimited by slashes (/) -- a path -- that ultimately point to the inode and the file's metadata and contents. Every file (inode) has at least one path name pointing to it. When more than one path name points to the same inode, they are all equally valid; there is no preferred or primary path name. These path names are sometimes referred to as "hard links", as distinguished from the "soft" or "symbolic" link, a special kind of file that contains a path name (possibly one of several hard links!) pointing to another file.
The dupmerge command works by reading a list of path names, identifying those that point to distinct inodes with the same content, deleting all but one, and recreating the path names for the deleted copies as hard links to the remaining copy. When it finishes, all the path names given to the program still exist, and each provides the same content when read. However, the redundant copies have been deleted and the space reclaimed by the filesystem free list.
Here is a simple shell command that invokes dupmerge on all the files in the current directory and under any and all subdirectories:
find . -print0 | dupmerge -0
dupmerge automatically ignores all directories, special device files, FIFOs -- everything that isn't an ordinary file -- so you don't have to bother with any fancy arguments to the find command. When it's done, every path name will still be present and yield the same contents when accessed. But any path names that had referred to separate copies of the same data are now hard links to a single copy.
Running dupmerge can still make some visible differences and it is important to understand them before you use the program on an important file system because there is, as yet, no way to reverse the effects of dupmerge. In UNIX, a file's metadata belongs to the inode, not the path name. The metadata includes the ownership and group membership of the file, its access permissions, and the times of its creation and (if enabled) last access and inode modification. Since only one inode in each group of inodes with the same contents will survive an execution of dupmerge, file ownerships, permissions and modification timestamps may change for certain pathnames even though the contents remain unchanged. The metadata on the deleted copies of the file is currently lost, though a future version of dupmerge may save this data in an "undo" file so that its effects could be undone. Until then, it is best to limit dupmerge to directory hierarchies whose files have a single owner and group, and to anticipate any problems that might be caused by the loss of metadata on the redundant copies that will be deleted. Dupmerge always keeps the oldest copy among a group of identical files on the principle that the oldest timestamp is most likely the actual time that the data was created, with the newer timestamps merely reflecting the time that each copy was made from the original.
Because empty files are often used as system lock files, with significance in the inode metadata, dupmerge ignores them by default. (There's little to reclaim by deleting an empty file anyway.)
I often run dupmerge when a file system is out of (or is about to run out of) free space, so scanning the largest files first usually recovers space more quickly. I can't think of a reason to examine the smallest files first, but if you can this option will do it.
My second version circa 1999 unlinked the duplicates as a side effect of the sort comparison function.
I have since rewritten it again from scratch. It begins by sorting the file list by size. Files with unique sizes are ignored; two or more files having the same size are compared by their SHA-1 hashes. Comparing hashes, as opposed to actual file contents, has significant performance benefits in most real-world workloads. Each file can be read sequentially, without any disk seeks, and its hash cached so that it need only be read once no matter how many other files have the same size.
But there are nonetheless pathological situations where this algorithm performs much more slowly than direct file-to-file comparison. Consider a collection of a thousand unique files, each exactly 10 megabytes in size. The hash comparison strategy involves reading every file in its entirety while a direct file-to-file comparison can abort as soon as the first difference is detected. Most files, if different, differ near their beginnings.
To handle this possibility with reasonable efficiency, comparing a pair of files actually entails a series of steps. First, both files must have the same size and be on the same file system. (Hard links cannot extend across file systems.) Second, I compute and compare the SHA-1 hash of the first page (4 KB) of each file. This will catch most differing files. If, and only if, the first 4KB of each file have the same hash do I proceed to compute and compare the SHA-1 hash for each entire file. This performs well in practice because most files that differ at all will do so in the first 4KB. It is unusual for two files to have the same size and the same first page, yet differ beyond that. However, it is certainly possible, which is why it is still necessary to compare the complete hashes before declaring two files to be identical.
Every hash result (on the leading page or over the full file) is cached so it never has to be computed more than once. This improves performance substantially when there are many files of the same size.
The program displays copious statistics at the end of execution, such as the number of disk blocks reclaimed from duplicates, the number of first-page and complete-file hash function computations, the number of hash "hits" (references to hash values that have already been computed), and the number of same-size file pairs whose first page hashes match but differ in their full-file hashes.
Like dupmerge, rmdups also accepts a list of path names from standard input, e.g.,
find . -print0 | rmdups -0
It takes only one argument, -0, that specifies that each input path name is null terminated. I strongly recommend this, along with the -print0 option to find.
Hard links to common files are identified and grouped. If the input includes every hard link to some particular file, they are displayed and the user is prompted as to which he would like to keep; all the others are unlinked. But if the input contains only some of the path names for a given file, those that are cited are automatically unlinked without user intervention.
This feature seems more dangerous than it really is, but it's what makes rmdups useful when cleaning up directories of redundant entries. First of all, rmdups never, under any circumstances, unlinks the last remaining path name to a file (which is what actually deletes a file in a UNIX-like file system.)
Consider two directories, 'work' and 'tmp'. Some time ago you created 'tmp' as a snapshot of 'work', and now you'd like to clean things up and merge everything back into the 'work' directory. You have already executed
find work tmp -print0 | dupmerge -0
so they contain links to a more or less common set of files. Now you'd like to get rid of the 'tmp' directory, but the latter also has a few unique files that you'd like to keep so you don't want to issue a simple rm -rf tmp. The command find tmp -print0 | rmdups -0 automatically clears every path name under 'tmp' that points to a file also referred to from 'work', leaving any singly-linked files under 'tmp' untouched so you can go through them by hand.
Consider this sequence:
find . -print0 | dupmerge
find . -print0 | rmdups
This sequence first identifies duplicate files and reclaims the space in the redundant copies. Then it lists all the path names that point to each file, prompting the user to select one to keep.
rmdups is much less mature than dupmerge, and it needs more work. Right now, for example, there is no way to specify that all of the path names for a file should be kept without simply quitting the program.
Both dupmerge and rmdups, by their nature, are programs that can potentially do a lot of damage to a file system should there prove to be any bugs. With this in mind, I've included a lot of rather paranoid assert checks that will cause the program to abort should something not be as expected. I have used dupmerge for quite some time without any problems, but I wouldn't bet my life on their being none. Should you run into any bugs, please let me know ASAP so I can fix them.
Phil Karn, KA9Q, email@example.com
Last updated: 7 May 2010