Python 2.7.1 Goodness
So far my favorite additions and changes in Python 2.7.1 since upgrading from the default Python 2.6.1 installation in Mac OS X Snow Leopard are the following:
- Dictionary and Set Comprehensions.
List comprehensions are one of my favorite language features in Python, they are incredibly useful for processing and building lists. So I am very excited to see dictionary and set comprehensions back ported from Python 3 to Python 2.7.1. - The ArgParse Module.
As a C/C++ programmer I original did command line argument processing in Python manually using sys.argv, then I discovered the C-style getopt module. I always found myself wondering if there was a more concise Pythonic way to handle command line parameters. The argparse module is the solution, it replaces the optparse module. I particularly like how argparse (and optparse) will generate the command line help for you! - csv.DictWriter.writeheader method.
While this is a very minor change (in Python 2.7 to be precise), I am a big fan of the csv module’s DictWriter class as a way to easily dump lists of dictionaries to a file for easy analysis and debugging with Excel. The addition of the DictWriter class of an new writeheader method makes this class even easier to use.
You can find the full release notes for Python 2.7.1 here, there are so many more changes than I’ve covered here so its well worth checking out the release notes. What are your favorite changes in Python 2.7.1?
Finding duplicate files using Python
I wrote this script to find and optionally delete duplicate files in a directory tree. The script uses MD5 hashes of each file’s content to detect duplicate files. This script is based on zalew’s answer on stackoverflow. So far I have found this script sufficient for accurately finding and removing duplicate files in my photograph collection.
"""Find duplicate files inside a directory tree."""
from os import walk, remove, stat
from os.path import join as joinpath
from md5 import md5
def find_duplicates( rootdir ):
"""Find duplicate files in directory tree."""
filesizes = {}
# Build up dict with key as filesize and value is list of filenames.
for path, dirs, files in walk( rootdir ):
for filename in files:
filepath = joinpath( path, filename )
filesize = stat( filepath ).st_size
filesizes.setdefault( filesize, [] ).append( filepath )
unique = set()
duplicates = []
# We are only interested in lists with more than one entry.
for files in [ flist for flist in filesizes.values() if len(flist)>1 ]:
for filepath in files:
with open( filepath ) as openfile:
filehash = md5( openfile.read() ).hexdigest()
if filehash not in unique:
unique.add( filehash )
else:
duplicates.append( filepath )
return duplicates
if __name__ == '__main__':
from argparse import ArgumentParser
PARSER = ArgumentParser( description='Finds duplicate files.' )
PARSER.add_argument( 'root', metavar='R', help='Dir to search.' )
PARSER.add_argument( '-remove', action='store_true',
help='Delete duplicate files.' )
ARGS = PARSER.parse_args()
DUPS = find_duplicates( ARGS.root )
print '%d Duplicate files found.' % len(DUPS)
for f in sorted(DUPS):
if ARGS.remove == True:
remove( f )
print '\tDeleted '+ f
else:
print '\t'+ f
I discovered the argparse module (added in Python 2.7) in the standard library this week and it makes command line parameter handling nice and concise.
UPDATE: Changed uniques array into a set and added first pass using file sizes as performance improvement, allot faster now.
UPDATE: You can now find this script on github at github.com/dpbrown/Duplicate-Files.
Garr – Lessons from Bamboo
I have been a fan of Garr Reynolds, since I discovered his first book Presentation Zen. This is Garr’s recent talk from TEDxTokyo which is well worth watching.
These are the slides for the talk, which are done in Garr’s trademark style.
The ascendancy of JSON
I’ve long been in despair over the popularity of XML as an information interchange format. My main complaint is that is so verbose that it is very easy to end up with the XML document structure taking up more memory than the actual data it encodes. This phenomenon is so common it even has a name: the ‘Angle Bracket Tax‘ and can be very painful on memory or bandwidth limited embedded systems.
JSON is based on a subset of the JavaScript scripting language and this is one of the big drivers of its adoption is that JSON is trivial to work with in JavaScript applications. Mainstream adoption is taking place with languages like Python and Ruby and frameworks like Microsoft’s .Net offering JSON support.
Karsten Januszewski has an interesting post on ‘The Rise of JSON‘ that is well worth checking out.
Hacking Work Manifesto
This is an interesting video manifesto from the guys at Hacking Work: working around cooperate processes and systems to achieve higher productivity.
What stands out the most to me in this video is the following statistic: ‘Workers receive 325 pages of information a day but only uses about 5 pages’. As I’ve recently had to modify my email filtering scheme to mark certain categories of email as read automatically to prevent information overload.
Extracting image EXIF data with Python
Most digital cameras and smartphones embed EXIF (EXchangeable Image Format) data into the photographs they capture. This can include: camera make & model, date and time, camera settings like orientation, aperture, ISO, shutter speed, focal, length and even GPS location.
After a bit of experimentation I have found the following method of using the undocumented ExifTags module in the Python Image Library (PIL) to be the simplest way to extract EXIF tags from images using Python. There are other EXIF modules available for Python however currently PIL is the simplest to install on Mac OS X.
from PIL import Image
from PIL.ExifTags import TAGS
def get_exif_data(fname):
"""Get embedded EXIF data from image file."""
ret = {}
try:
img = Image.open(fname)
if hasattr( img, '_getexif' ):
exifinfo = img._getexif()
if exifinfo != None:
for tag, value in exifinfo.items():
decoded = TAGS.get(tag, tag)
ret[decoded] = value
except IOError:
print 'IOERROR ' + fname
return ret
The above code was based on the code snippet in Paolo’s answer to this StackOverflow question. I have added basic exception handling and a check for the existence of the _getexif attribute prior to accessing it.








