dupmerge.c

This is a utility that scans a UNIX directory tree looking for pairs of distinct files with identical content. When it finds such files, it deletes one file to reclaim its disk space and then recreates its path name as a link to the other copy.

My first version of this program circa 1993 worked by computing MD5 hashes of every file, sorting the hashes and then looking for duplicates. This worked, but it was unnecessarily slow. The comparison function I use now stops comparing two files as soon as it determines their lengths are different, which is a win when you have many large files with unique lengths.

Last updated: 16 Jan 1999