Finding duplicate files using Python
I wrote this script to find and optionally delete duplicate files in a directory tree. The script uses MD5 hashes of each file’s content to detect duplicate files. This script is based on zalew’s answer on stackoverflow. So far I have found this script sufficient for accurately finding and removing duplicate files in my photograph collection.
"""Find duplicate files inside a directory tree."""
from os import walk, remove, stat
from os.path import join as joinpath
from md5 import md5
def find_duplicates( rootdir ):
"""Find duplicate files in directory tree."""
filesizes = {}
# Build up dict with key as filesize and value is list of filenames.
for path, dirs, files in walk( rootdir ):
for filename in files:
filepath = joinpath( path, filename )
filesize = stat( filepath ).st_size
filesizes.setdefault( filesize, [] ).append( filepath )
unique = set()
duplicates = []
# We are only interested in lists with more than one entry.
for files in [ flist for flist in filesizes.values() if len(flist)>1 ]:
for filepath in files:
with open( filepath ) as openfile:
filehash = md5( openfile.read() ).hexdigest()
if filehash not in unique:
unique.add( filehash )
else:
duplicates.append( filepath )
return duplicates
if __name__ == '__main__':
from argparse import ArgumentParser
PARSER = ArgumentParser( description='Finds duplicate files.' )
PARSER.add_argument( 'root', metavar='R', help='Dir to search.' )
PARSER.add_argument( '-remove', action='store_true',
help='Delete duplicate files.' )
ARGS = PARSER.parse_args()
DUPS = find_duplicates( ARGS.root )
print '%d Duplicate files found.' % len(DUPS)
for f in sorted(DUPS):
if ARGS.remove == True:
remove( f )
print '\tDeleted '+ f
else:
print '\t'+ f
I discovered the argparse module (added in Python 2.7) in the standard library this week and it makes command line parameter handling nice and concise.
UPDATE: Changed uniques array into a set and added first pass using file sizes as performance improvement, allot faster now.
UPDATE: You can now find this script on github at github.com/dpbrown/Duplicate-Files.
Extracting image EXIF data with Python
Most digital cameras and smartphones embed EXIF (EXchangeable Image Format) data into the photographs they capture. This can include: camera make & model, date and time, camera settings like orientation, aperture, ISO, shutter speed, focal, length and even GPS location.
After a bit of experimentation I have found the following method of using the undocumented ExifTags module in the Python Image Library (PIL) to be the simplest way to extract EXIF tags from images using Python. There are other EXIF modules available for Python however currently PIL is the simplest to install on Mac OS X.
from PIL import Image
from PIL.ExifTags import TAGS
def get_exif_data(fname):
"""Get embedded EXIF data from image file."""
ret = {}
try:
img = Image.open(fname)
if hasattr( img, '_getexif' ):
exifinfo = img._getexif()
if exifinfo != None:
for tag, value in exifinfo.items():
decoded = TAGS.get(tag, tag)
ret[decoded] = value
except IOError:
print 'IOERROR ' + fname
return ret
The above code was based on the code snippet in Paolo’s answer to this StackOverflow question. I have added basic exception handling and a check for the existence of the _getexif attribute prior to accessing it.
Graphing real data with MatPlotLib
In a previous post I covered the basics of graphing in Python with the MatPlotLib module. In this post I am going to demostrate how to use MatPlotLib with some real world data retrieved from a web service and then processed into a format usable by MatPLotLib.
The example script performs the following steps:
- Takes a specified stock’s ticker symbol and column to plot over time (from Open, High, Low, Close, Volume, Adj Close) as input.
- Fetches the corresponding stock data from Yahoo! Finance and saves it into a CSV file using the urllib module.
- Processes the data in the CSV file into a suitable format for matplotlib using the csv, datetime and matplotlib.dates modules.
- Plots a graph of the data plotted over time using MatPlotLib and a saves a copy as PNG format image.
Note: To keep the example concise I am not performing any error handling.
"""Fetches specified stock data from Yahoo and graph it with MatPlotLib."""
from urllib import urlretrieve
from csv import DictReader
from matplotlib import pyplot
from matplotlib.dates import date2num
from datetime import datetime
def fetchstockdata( stockticker, filename ):
"""Fetch specified stock data and store it in named file."""
url = 'http://ichart.finance.yahoo.com/table.csv?s=%s' % stockticker
urlretrieve( url, filename )
def importstockdata( filename ):
"""Import CSV data into dict of lists, converting dates into timestamps."""
results = {}
for row in DictReader( open( filename,'rb' ) ):
for col in row.keys():
if col == 'Date':
coldata = date2num( datetime.strptime( row[col], '%Y-%m-%d') )
else:
coldata = row[col]
results.setdefault( col, [] ).append( coldata )
return results
def plotstockdata( stockdata, stockticker, dates, col ):
"""Use MatPlotLib to graph speciifed stock data."""
pyplot.plot_date( stockdata[dates], stockdata[col], '-', xdate=True )
pyplot.title( '%s - %s / %s' % (stockticker, col, dates) )
pyplot.xlabel( dates )
pyplot.ylabel( col )
pyplot.savefig( '%s.png' % stockticker )
pyplot.show()
if __name__ == '__main__':
from sys import argv
# Use second argument as ticker and third argument as column.
TICKER = argv[1].upper()
COL = argv[2]
# Grab the stock data from Yahoo!
FILENAME = '%s.csv' % TICKER
fetchstockdata( TICKER, FILENAME )
# Import the data.
DATA = importstockdata( FILENAME )
# Plot the graph with Date as X-Axis and User selected column as Y-Axis.
plotstockdata( DATA, TICKER, 'Date', COL )
Running this script with using the command line “python StockChart.py goog ‘Adj Close’” will produce a chart like the following.

This is a good example of why I like Python’s batteries included philosophy so much: it means I spend more of my time writing interesting bits of code as the utility functionality I need has already been implemented or is only an easy_install away.
Basic graphing with MatPlotLib
One of the Python modules that has most interested me recently is MatPlotLib which is a sophisticated graphing module which can be used to create journal grade graphs of almost anything. The official gallery for MatPlotLib is worth checking out to get an idea of the sheer range of graph types it can be used to create.
It is simple enough to get started using MatPlotLib for example to create a line graph of x*x and save it as a PNG file requires only the following:
"""Simple demonstration of MatPlotLib plotting.""" from matplotlib import pyplot X = range(0,100) Y = [ i*i for i in X ] pyplot.plot( X, Y, '-' ) pyplot.title( 'Plotting x*x' ) pyplot.xlabel( 'X Axis' ) pyplot.ylabel( 'Y Axis' ) pyplot.savefig( 'Simple.png' ) pyplot.show()
The above script will produce the following graph:

To plot data over a time period the simplest solution is to convert date/time units to timestamps using MatPlotLibs date2num function and then to plot using the plot_date method as follows:
"""Simple demonstration of MatPlotLib Date plotting.""" from matplotlib import pyplot from matplotlib.dates import date2num from datetime import datetime, timedelta # Generate a series of timestamps from today to today + 100 years. X = [date2num(datetime.today()+timedelta(days=365*x)) for x in range(0,100)] Y = [i*i for i in range(0,100)] pyplot.plot_date( X, Y, '-', xdate=True ) pyplot.title( 'Plotting x*x' ) pyplot.xlabel( 'X Axis' ) pyplot.ylabel( 'Y Axis' ) pyplot.savefig( 'SimpleDates.png' ) pyplot.show()
Which will generate a chart like the following:

As you can see it is fairly simple to graph data using MatPlotLib. This makes Python and MatPlotLib a compelling solution for data analysis when combined with the many available modules for dealing with common data storage formats like text (using RegEx), CSV, XML and JSON files and SQL databases.
Praise for Python
- Simplified Memory management
I am so much more productive when I am not having to worry about pointer related errors e.g. pointer math or sweat the subtleties of memory management e.g. memory alignment while writing code. - Less structural syntax
After using Python for a while I really appreciate it’s use of indentation to give a program structure, as it makes python source code much more concise than C/C++. - No compiling or linking
It is so much easier to stay in the flow when your not waiting 5-30 minutes for compilation and linking. I’ve recently taken to running PyLint when I miss the feedback from a compiler/linker on my program structure and to learn the coding style outlined in the Python Style Guide. - Selective imports
Having worked on large scale C/C++ projects for most of my career I really appreciate the ability to only import what I want from modules and the option to also rename (or alias) what I’ve imported. - Batteries included philosophy
The sheer scope of the library of modules included in Python means I can spend more time writing the interesting parts of my programs, as most of the time the utility functionality I need is just an import away. - Package management
The Python Package Index (PyPi) and Setup Tools module make installing most python modules as simple as ‘easy_install <module_name>’. - Duck typing
Python’s use of Duck Typing emphasizes interfaces over types which makes it so much easier to supply my own classes to standard library functions, as I only have to implement as much of the interface as is required.
Installing Python, MatPlotLib & iPython on Snow Leopard
As I have detailed in a previous post the installation of MatPlotLib on Mac OS was a fairly involved process involving the using of Mac Ports to compile and build a complete Python stack. Thankfully it would seem things have become much simpler on Mac OS X 10.6.7 if you are installing Python 2.7.1, MatPlotLib 1.0.1 and iPython 0.10.1. Note: currently only the 32 bit version of Python will work consistently with MatPlotLib and iPython.
- First Python 2.7.1:
- Download the prebuilt ‘Python 2.7.1 Mac OS X 32-bit i386/PPC Installer’ DMG from python.org.
- Mount the DMG image and run the contained installer.
- Verify it worked by opening a terminal and running the command ‘python -V’ which should return ‘Python 2.7.1′.
- Next MatPlotLib 1.0.1:
- Download the prebuilt ‘matplotlib-1.0.1-python.org-32bit-py2.7-macosx10.3′ DMG from MatPlotLib’s SourceForge page.
- Mout the DMG image and run the contained installer.
- Verify this worked by opening a terminal, running python and then ‘import matplotlib’ followed by ‘print matplotlib.__version__’ which should return ’1.0.1′.
- Finally iPython 0.10.1:
- Download the iPython source ‘ipython-0.10.1.zip’ from the iPython download directory.
- Extract the zip file.
- Open a terminal window and CD into the newly extracted directory ‘ipython-0.10.1′.
- Run the command ‘sudo python setup.py install’ and enter your password when prompted.
- Verify this by running iPython with MatPlotLib via ‘ipython -pylab’ and then ‘x = randn(10000)’ followed by ‘hist(x, 100)’ and a chart window like the following image should pop up.









