Getting started with Python

The following is how I’d recommend getting started programming in Python:

  1. The Python Tutorial .
    First off work your way through the official Python tutorial, its very comprehensive and covers all the language features and also has a quick tour of the modules available in the standard library.
  2. Code Like a Pythonista: Idiomatic Python
    Next I’d highly recommend reading the ‘Code Like a Pythonista‘ article is its entirety, its very useful for learning about the Pythonic way of thinking.
  3. The Python Style Guide.
    Next read the Python Style Guide (know as PEP-8), which will teach you about the general python coding style which depending on the languages you’ve used before could be quite a different style.
  4. The Python Challenge.
    Now try the Python challenge, this will push your new Python skills and riddle solving abilites.  If you get stuck the official forums are helpful, I found I got stuck on the riddles more than the programming.  Once you’ve solved each of the challenges I’d strongly recommend going and checking out the submitted solutions to that challenges.  I found this a incredibly helpful learning experience, as by looking at the solutions I learned the pythonic way to solve the problems.  Note: you can’t access these solutions till you’ve solved them yourself.
  5. ‘Learn Python the Hard Way’ or ‘Dive into Python’.
    For gaining further knowledge there are several ebooks available online for free: the first is Learn Python the Hard Way and there is also the dated Dive into Python.  I’ve not read Learn Python the Hard Way but I’ve heard good reviews from several people.

For getting help with Python programming I’d recommend:

  • Stack Overflow.
    Stack Overflow is a collaborative quesion and answers site for programmers and has a very active python community.  It is highly recommended to searching to see if your question has been asked already before  posting a question.
  • #python on
    Visiting the #python IRC channel on is also a very good way to get help with Python questions.  You can find our more about the various Python IRC channels here.  Note: You’ll need an IRC client like X-Chat (Linux & Windows) or Colloquy (Mac).

Here are some tools I’d recommending picking up:

  • Package installer – PIP or easy_install.
    PIP is the current Python package installer of choice and lets you easily download and install Python from various sources such as the official Python package repository – PyPi and SourceForge.  I found that PIP makes installing new Python packages trivial 99% of the time, the other 1% of the time you’ll need to build the packages locally which is more involved.  Note: Windows users may be better off sticking to the older easy_install tool instead of PIP.
  • Enhanced command line – iPython or bPython.
    iPython is an enhanced command line environment for Python that I’d highly recommend over the basic command line interpreter.  You can find several different  of video tutorials for iPython listed here. I am told that bPython is another enhanced command line that is worth checking out too.
  • Code analyser – PyLint or pyflakes.
    PyLint is a python version of the Lint C/C++ static code analysis tool which will analyse your Python code and give you useful feedback on your code as well as a score out of 10.  PyLint will also check your code adheres to the official Python Style Guide which I found very useful for learning the Python coding style.  Alternatively pyflakes has also been recommended for static analysis of python code.

I’d be interested in hearing of any other resources you found useful to help you get started with python.

Python 2.7.1 Goodness

So far my favorite additions and changes in Python 2.7.1 since upgrading from the default Python 2.6.1 installation in Mac OS X Snow Leopard are the following:

  1. Dictionary and Set Comprehensions.
    List comprehensions are one of my favorite language features in Python, they are incredibly useful for processing and building lists.  So I am very excited to see dictionary and set comprehensions back ported from Python 3 to Python 2.7.1.
  2. The ArgParse Module.
    As a C/C++ programmer I original did command line argument processing in Python manually using sys.argv, then I discovered the C-style getopt module.  I always found myself wondering if there was a more concise Pythonic way to handle command line parameters.  The argparse module is the solution, it replaces the optparse module.  I particularly like how argparse (and optparse) will generate the command line help for you!
  3. csv.DictWriter.writeheader method.
    While this is a very minor change  (in Python 2.7 to be precise), I am a big fan of the csv module’s DictWriter class as a way to easily dump lists of dictionaries to a file for easy analysis and debugging with Excel.  The addition of the DictWriter class of an new writeheader method makes this class even easier to use.

You can find the full release notes for Python 2.7.1 here, there are so many more changes than I’ve covered here so its well worth checking out the release notes.  What are your favorite changes in Python 2.7.1?

Finding duplicate files using Python

I wrote this script to find and optionally delete duplicate files in a directory tree.  The script uses MD5 hashes of each file’s content to detect duplicate files. This script is based on zalew’s answer on stackoverflow. So far I have found this script sufficient for accurately finding and removing duplicate files in my photograph collection.

"""Find duplicate files inside a directory tree."""

from os import walk, remove, stat
from os.path import join as joinpath
from md5 import md5

def find_duplicates( rootdir ):
    """Find duplicate files in directory tree."""
    filesizes = {}
    # Build up dict with key as filesize and value is list of filenames.
    for path, dirs, files in walk( rootdir ):
        for filename in files:
            filepath = joinpath( path, filename )
            filesize = stat( filepath ).st_size
            filesizes.setdefault( filesize, [] ).append( filepath )
    unique = set()
    duplicates = []
    # We are only interested in lists with more than one entry.
    for files in [ flist for flist in filesizes.values() if len(flist)>1 ]:
        for filepath in files:
            with open( filepath ) as openfile:
                filehash = md5( ).hexdigest()
            if filehash not in unique:
                unique.add( filehash )
                duplicates.append( filepath )
    return duplicates

if __name__ == '__main__':
    from argparse import ArgumentParser

    PARSER = ArgumentParser( description='Finds duplicate files.' )
    PARSER.add_argument( 'root', metavar='R', help='Dir to search.' )
    PARSER.add_argument( '-remove', action='store_true',
                         help='Delete duplicate files.' )
    ARGS = PARSER.parse_args()

    DUPS = find_duplicates( ARGS.root )

    print '%d Duplicate files found.' % len(DUPS)
    for f in sorted(DUPS):
        if ARGS.remove == True:
            remove( f )
            print '\tDeleted '+ f
            print '\t'+ f

I discovered the argparse module (added in Python 2.7) in the standard library this week and it makes command line parameter handling nice and concise.

UPDATE: Changed uniques array into a set and added first pass using file sizes as performance improvement, allot faster now.

UPDATE: You can now find this script on github at

The ascendancy of JSON

I’ve long been in despair over the popularity of XML as an information interchange format.  My main complaint is that is so verbose that it is very easy to end up with the XML document structure taking up more memory than the actual data it encodes.  This phenomenon is so common it even has a name: the ‘Angle Bracket Tax‘ and can be very painful on memory or bandwidth limited embedded systems.

JSON is based on a subset of the JavaScript scripting language and this is one of the big drivers of its adoption is that JSON is trivial to work with in JavaScript applications.  Mainstream adoption is taking place with languages like Python and Ruby and frameworks like Microsoft’s .Net offering JSON support.

Karsten Januszewski has an interesting post on ‘The Rise of JSON‘ that is well worth checking out.

Extracting image EXIF data with Python

Most digital cameras and smartphones embed EXIF (EXchangeable Image Format) data into the photographs they capture.  This can include: camera make & model, date and time, camera settings like orientation, aperture, ISO, shutter speed, focal, length and even GPS location.

After a bit of experimentation I have found the following method of using the undocumented ExifTags module in the Python Image Library (PIL) to be the simplest way to extract EXIF tags from images using Python.  There are other EXIF modules available for Python however currently PIL is the simplest to install on Mac OS X.

from PIL import Image
from PIL.ExifTags import TAGS

def get_exif_data(fname):
    """Get embedded EXIF data from image file."""
    ret = {}
        img =
        if hasattr( img, '_getexif' ):
            exifinfo = img._getexif()
            if exifinfo != None:
                for tag, value in exifinfo.items():
                    decoded = TAGS.get(tag, tag)
                    ret[decoded] = value
    except IOError:
        print 'IOERROR ' + fname
    return ret

The above code was based on the code snippet in Paolo’s answer to this StackOverflow question. I have added basic exception handling and a check for the existence of the _getexif attribute prior to accessing it.

Graphing real data with MatPlotLib

In a previous post I covered the basics of graphing in Python with the MatPlotLib module.  In this post I am going to demostrate how to use MatPlotLib with some real world data retrieved from a web service and then processed into a format usable by MatPLotLib.

The example script performs the following steps:

  1. Takes a specified stock’s ticker symbol and column to plot over time (from Open, High, Low, Close, Volume, Adj Close) as input.
  2. Fetches the corresponding stock data from Yahoo! Finance and saves it into a CSV file using the urllib module.
  3. Processes the data in the CSV file into a suitable format for matplotlib using the csv, datetime and matplotlib.dates modules.
  4. Plots a graph of the data plotted over time using MatPlotLib and a saves a copy as PNG format image.

Note: To keep the example concise I am not performing any error handling.

"""Fetches specified stock data from Yahoo and graph it with MatPlotLib."""

from urllib import urlretrieve
from csv import DictReader
from matplotlib import pyplot
from matplotlib.dates import date2num
from datetime import datetime

def fetchstockdata( stockticker, filename ):
    """Fetch specified stock data and store it in named file."""
    url = '' % stockticker
    urlretrieve( url, filename )

def importstockdata( filename ):
    """Import CSV data into dict of lists, converting dates into timestamps."""
    results = {}
    for row in DictReader( open( filename,'rb' ) ):
        for col in row.keys():
            if col == 'Date':
                coldata = date2num( datetime.strptime( row[col], '%Y-%m-%d') )
                coldata = row[col]
            results.setdefault( col, [] ).append( coldata )
    return results

def plotstockdata( stockdata, stockticker, dates, col ):
    """Use MatPlotLib to graph speciifed stock data."""
    pyplot.plot_date( stockdata[dates], stockdata[col], '-', xdate=True )
    pyplot.title( '%s - %s / %s' % (stockticker, col, dates) )
    pyplot.xlabel( dates )
    pyplot.ylabel( col )
    pyplot.savefig( '%s.png' % stockticker )

if __name__ == '__main__':
    from sys import argv
    # Use second argument as ticker and third argument as column.
    TICKER = argv[1].upper()
    COL = argv[2]
    # Grab the stock data from Yahoo!
    FILENAME = '%s.csv' % TICKER
    fetchstockdata( TICKER, FILENAME )
    # Import the data.
    DATA = importstockdata( FILENAME )
    # Plot the graph with Date as X-Axis and User selected column as Y-Axis.
    plotstockdata( DATA, TICKER, 'Date', COL )

Running this script with using the command line “python goog ‘Adj Close’” will produce a chart like the following.

This is a good example of why I like Python’s batteries included philosophy so much: it means I spend more of my time writing interesting bits of code as the utility functionality I need has already been implemented or is only an easy_install away.