Finding duplicate files using Python
I wrote this script to find and optionally delete duplicate files in a directory tree. The script uses MD5 hashes of each file’s content to detect duplicate files. This script is based on zalew’s answer on stackoverflow. So far I have found this script sufficient for accurately finding and removing duplicate files in my photograph collection.
"""Find duplicate files inside a directory tree."""
from os import walk, remove, stat
from os.path import join as joinpath
from md5 import md5
def find_duplicates( rootdir ):
"""Find duplicate files in directory tree."""
filesizes = {}
# Build up dict with key as filesize and value is list of filenames.
for path, dirs, files in walk( rootdir ):
for filename in files:
filepath = joinpath( path, filename )
filesize = stat( filepath ).st_size
filesizes.setdefault( filesize, [] ).append( filepath )
unique = set()
duplicates = []
# We are only interested in lists with more than one entry.
for files in [ flist for flist in filesizes.values() if len(flist)>1 ]:
for filepath in files:
with open( filepath ) as openfile:
filehash = md5( openfile.read() ).hexdigest()
if filehash not in unique:
unique.add( filehash )
else:
duplicates.append( filepath )
return duplicates
if __name__ == '__main__':
from argparse import ArgumentParser
PARSER = ArgumentParser( description='Finds duplicate files.' )
PARSER.add_argument( 'root', metavar='R', help='Dir to search.' )
PARSER.add_argument( '-remove', action='store_true',
help='Delete duplicate files.' )
ARGS = PARSER.parse_args()
DUPS = find_duplicates( ARGS.root )
print '%d Duplicate files found.' % len(DUPS)
for f in sorted(DUPS):
if ARGS.remove == True:
remove( f )
print '\tDeleted '+ f
else:
print '\t'+ f
I discovered the argparse module (added in Python 2.7) in the standard library this week and it makes command line parameter handling nice and concise.
UPDATE: Changed uniques array into a set and added first pass using file sizes as performance improvement, allot faster now.
UPDATE: You can now find this script on github at github.com/dpbrown/Duplicate-Files.
The return value of hexdigest() is a string. So in the code ‘unique’ can be a set instead of list. Making ‘unique’ a set makes the intent more clear and also a ‘not in’ on a set would be faster
Why not:
…
def find_duplicates( rootdir ):
“”"Find duplicate files in directory tree.”"”
unique = set()
for path, dirs, files in walk( rootdir ):
for filename in files:
filepath = joinpath(path, filename )
filehash = md5(open(filepath , ‘rb’).read()).digest()
if filehash not in unique:
unique.add(filehash)
else:
yield filepath
…
Hm, but this doesn’t show which files are the same? I would rather use a dictionary that maps from md5 sum to filepaths:
from collections import defaultdict
def find_duplicates( rootdir ):
dup = defaultdict(list)
for path, dirs, files in walk( rootdir ):
for filename in files:
filepath = joinpath( path, filename )
try:
filehash = md5( file( filepath ).read() ).hexdigest()
except IOError:
continue
dup[filehash].append(filepath)
return [paths for paths in dup.values() if len(paths) > 1]
The whole script can be also simulated in this simple shell action:
$ find . -type f -exec md5sum {} \; | uniq -d -w 32
If your files tend to be large (like video files), or file access is low (like files on a network share), you may want to do the duplicate detection in stages.
It can be relatively expensive to read in a whole file and calculate an MD5 hash. It’s cheap to find out a file’s size, though, and files with different sizes cannot be duplicates of each other. So if you start by creating a list of files along with their sizes, you can go back and only calculate the MD5 hash for the files that have identical lengths.
Nice article. You could use set instead of list to speedup lookups.
If you’ve got a lot of files, the ‘filehash not in uniques’ check will get very expensive. This is because Python has to scan through the list one by one looking for filehash. So if you have N files, it will take O(N^2) time.
If you initialise unique with set() instead of [], the ‘not in’ check will be quick even with a lot of distinct files.
I wrote a tool like this, too. It compares files byte-by-byte, which can be quite a lot of files than hashing the whole contents. See http://liw.fi/dupfiles/ for my program. That page lists a few other tools too, the fastest of which seems to be the one called hardlink (http://jak-linux.org/projects/hardlink/).
Regarding the md5 hashes: it’s really not necessary to hash the whole file. When I did a similar task I indexed on file size as Jerry did, then on an md5 has of the first 1kbytes, and only if there were matches on that did I do a full file hash.
Conversely, if one compares the md5(1k) hashes (and deliberately ignores file sizes) the program can find files where one is a truncated version of another.
Lastly, your program now stores its results only in a volatile storage, which means if the task is big (if you’re looking for duplicates among millions of files, for example) interrupting it or rerunning it means all the work must be done again. If you store the data instead in a persistent store like PyBSDDB then you can resume and rerun with little additional work.
Repost with working link:
I use fdupes for this, it’s packaged in at least Ubuntu (sudo apt-get install fdupes).
For photos there’s also findimagedupes (sudo apt-get install findimagedupes) which even finds visually similar images.
Why nobody is wondering, that the “open” file handles won’t be closed?
Open file objects should be closed: always!
The use of the “with” statement (Python 2.6+) may simplify this really good.
with file(filepath) as open_file:
filehash = md5(open_file.read()).hexdigest()
Daniel, the two places I usually look at are Debian (30 kilopackages and growing), since that makes it easy for me to install if it’s there, and http://freshmeat.net/ (who would like to list everything free).










You should use a set() instead of an array to store uniques checksums, you’ll have a ~constant time while seeking for a collision. Using an array, python does a linear search in the array, taking more and more time. (A little test shows that to store 10000 md5 in an array checking for collisions takes ~2s, in a set: ~0.1s).
Then shouldn’t you use a fastest ‘hash’ in a first pass to take apart very different files without opening them, like, file size ?
Finaly why reinvent the whell, just use fdupes
Also before deleting you should binary compare the files, in case of md5 collision.