Distributed Scrum

Software Engineering Radio’s recent podcast about Distributed Scrum was pretty interesting.

The most memorable comment from the while discussion was that a remote/external team is on that is more than thirty metres from you team.  This would mean that many other scrums team that would normally be considered internal teams would actually be external teams e.g. two teams could be in the same building but could be external to each other.

Another interesting idea was instead of having each separate remote group work on sperate tasks or features was to instead split development between the different teams.  The idea is that this will force communication to occur much more frequently than the traditional approach of having each team work in a separate siloed task.

Organising photographs with Python

Previously I posted about extracting EXIF information from images using the Python Image Library (PIL).  The reason I was investigating how to do this was I wanted to programmatically reorganise my personal photograph collection from its current ad-hoc mess to something more structured.

My goal was to use Python to extract the EXIF information from each image file and use the creation time of each image as key to organise each image into the directory structure Year/Month/Day.  If an image file is missing EXIF data then the file’s creation time can be used instead via an option.

An example of running this script to reoranise the photos folder and leave the original files in place would be:

python PhotoShuffle.py -copy /Daniel/Pictures /Daniel/OrganisedPictures

You can also find the latest version on github at github.com/endlesslycurious/PhotoShuffle, the following is the current script:

"""Scans a folder and builds a date sorted tree based on image creation time."""

if __name__ == '__main__':
    from os import makedirs, listdir, rmdir
    from os.path import join as joinpath, exists, getmtime
    from datetime import datetime
    from shutil import move, copy2 as copy
    from ExifScan import scan_exif_data
    from argparse import ArgumentParser

    PARSER = ArgumentParser(description='Builds a date sorted tree of images.')
    PARSER.add_argument( 'orig', metavar='O', help='Source root directory.')
    PARSER.add_argument( 'dest', metavar='D',
                         help='Destination root directory' )
    PARSER.add_argument( '-filetime', action='store_true',
                         help='Use file time if missing EXIF' )
    PARSER.add_argument( '-copy', action='store_true',
                         help='Copy files instead of moving.' )
    ARGS = PARSER.parse_args()

    print 'Gathering & processing EXIF data.'

    # Get creation time from EXIF data.
    DATA = scan_exif_data( ARGS.orig )

    # Process EXIF data.
    for r in DATA:
        info = r['exif']
        # precidence is DateTimeOriginal > DateTime.
        if 'DateTimeOriginal' in info.keys():
            r['ftime'] = info['DateTimeOriginal']
        elif 'DateTime' in info.keys():
            r['ftime'] = info['DateTime']
        if 'ftime' in r.keys():
            r['ftime'] = datetime.strptime(r['ftime'],'%Y:%m:%d %H:%M:%S')
        elif ARGS.filetime == True:
            ctime = getmtime( joinpath( r['path'], r['name'] + r['ext'] ))
            r['ftime'] = datetime.fromtimestamp( ctime )

    # Remove any files without datetime info.
    DATA = [ f for f in DATA if 'ftime' in f.keys() ]

    # Generate new path YYYY/MM/DD/ using EXIF date.
    for r in DATA:
        r['newpath'] = joinpath( ARGS.dest, r['ftime'].strftime('%Y/%m/%d') )

    # Generate filenames per directory: 1 to n+1 (zero padded) with DDMMMYY.
    print 'Generating filenames.'
    for newdir in set( [ i['newpath'] for i in DATA ] ):
        files = [ r for r in DATA if r['newpath'] == newdir ]
        pad = len( str( len(files) ) )
        usednames = []
        for i in range( len(files) ):
            datestr = files[i]['ftime'].strftime('%d%b%Y')
            newname = '%0*d_%s' % (pad, i+1, datestr)
            j = i+1
            # if filename exists keep looking until it doesn't. Ugly!
            while ( exists( joinpath( newdir, newname + files[i]['ext'] ) ) or
                newname in usednames ):
                j += 1
                jpad = max( pad, len( str( j ) ) )
                newname = '%0*d_%s' % (jpad, j, datestr)
            usednames.append( newname )
            files[i]['newname'] = newname

    # Copy the files to their new locations, creating directories as requried.
    print 'Copying files.'
    for r in DATA:
        origfile = joinpath( r['path'], r['name'] + r['ext'] )
        newfile = joinpath( r['newpath'], r['newname'] + r['ext'] )
        if not exists( r['newpath'] ):
            makedirs( r['newpath'] )
        if not exists( newfile ):
            if ARGS.copy:
                print 'Copying '+ origfile +' to '+ newfile
                copy( origfile, newfile )
            else:
                print 'Moving '+ origfile +' to '+ newfile
                move( origfile, newfile )
        else:
            print newfile +' already exists!'

    if ARGS.copy:
        print 'Removing empty directories'
        DIRS = set( [ d['path'] for d in DATA ] )
        for d in DIRS:
            # if the directory is empty then delete it.
            if len( listdir( d ) ) == 0:
                print 'Deleting dir ' + d
                rmdir( d )

UPDATE: I tend to run my duplicate file script over image collections before I organise them to remove any duplicates. You can find that script on github at github.com/endlesslycurious/Duplicate-Files.

Downloading Wallpaper Images from Reddit with Python

In my previous post I demonstrated how to query Reddit using Python and JOSN. My goal was a script to download the latest and greatest wallpapers off of image sub-reddits like wallpaper to keep my desktop wallpaper fresh and interesting. The main function of the script is to download any JPEG formatted image that listed in the specified sub-reddit and download them to a folder.

Allot of the script turned out to be managing URLs, handling exceptions and checking image types so that links to the most commonly encountered image repository: imgur worked. I opted to use the reddit hash id for each post as the filename for the downloaded JPEGs as this seems to be unique value, which means there are no collisions and its easy to programatically check if that item’s image has already been download or not. Although using a hash value instead of the items text title doesn’t make the most memorable filenames..

The single most frustrating thing I encountered when writing this script is that I have yet to discover a programatic way to work out the URL for an image on Flickr given a Flickr page URL. This is a real shame as Flickr is a really popular image hosting site with allot of great images.

An example of running the script to download images with a score greater than 50 from the wallpaper sub-reddit into a folder called wallpaper would be as follows:

python redditdownload.py wallpaper wallpaper -s 50 

And to run the same query but only get any new images you don’t already have, run the following:

python redditdownload.py wallpaper wallpaper -s 50 -update

You can find the source code for this post (and the previous) on GitHub at github.com/endlesslycurious/RedditImageGrab and the current source for the script is as follows:

"""Download images from a reddit.com subreddit."""

from urllib2 import urlopen, HTTPError, URLError 
from httplib import InvalidURL
from argparse import ArgumentParser
from os.path import exists as pathexists, join as pathjoin
from os import mkdir
from reddit import getitems

if __name__ == "__main__": 
    PARSER = ArgumentParser( description='Downloads files with specified externsion from the specified subreddit.')
    PARSER.add_argument( 'reddit', metavar='r', help='Subreddit name.')
    PARSER.add_argument( 'dir', metavar='d', help='Dir to put downloaded files in.')
    PARSER.add_argument( '-last', metavar='l', default='', required=False, help='ID of the last downloaded file.')
    PARSER.add_argument( '-score', metavar='s', default='0', type=int, required=False, help='Minimum score of images to download.')
    PARSER.add_argument( '-num', metavar='n', default='0', type=int, required=False, help='Number of images to process.')
    PARSER.add_argument( '-update', default=False, action='store_true', required=False, help='Run until you encounter a file already downloaded.')
    ARGS = PARSER.parse_args()
 
    print 'Downloading images from "%s" subreddit' % (ARGS.reddit)

    ITEMS = getitems( ARGS.reddit, ARGS.last )
    N = D = E = S = F = 0
    FINISHED = False

    # Create the specified directory if it doesn't already exist.
    if not pathexists( ARGS.dir ):
        mkdir( ARGS.dir )

    while len(ITEMS) > 0 and FINISHED == False:
        LAST = ''
        for ITEM in ITEMS:
            if ITEM['score'] < ARGS.score:
                print '\tSCORE: %s has score of %s which is lower than required score of %s.' % (ITEM['id'],ITEM['score'],ARGS.score) 
                S += 1
            else:
                FILENAME = pathjoin( ARGS.dir, '%s.jpg' % (ITEM['id'] ) )
                # Don't download files multiple times!
                if not pathexists( FILENAME ):
                    try:
                        if 'imgur.com' in ITEM['url']:
                            # Change .png to .jpg for imgur urls. 
                            if ITEM['url'].endswith('.png'):
                                ITEM['url'] = ITEM['url'].replace('.png','.jpg')
                            # Add .jpg to imgur urls that are missing it.
                            elif '.jpg' not in ITEM['url']:
                                ITEM['url'] = '%s.jpg' % ITEM['url']
                            elif '.jpeg' not in ITEM['url']:
                                ITEM['url'] = '%s.jpg' % ITEM['url']

                        RESPONSE = urlopen( ITEM['url'] )
                        INFO = RESPONSE.info()
                        
                        # Work out file type either from the response or the url.
                        if 'content-type' in INFO.keys():
                            FILETYPE = INFO['content-type']
                        elif ITEM['url'].endswith( 'jpg' ):
                            FILETYPE = 'image/jpeg'
                        elif ITEM['url'].endswith( 'jpeg' ):
                            FILETYPE = 'image/jpeg'
                        else:
                            FILETYPE = 'unknown'
                             
                        # Only try to download jpeg images.
                        if FILETYPE == 'image/jpeg':
                            FILEDATA = RESPONSE.read()
                            FILE = open( FILENAME, 'wb')
                            FILE.write(FILEDATA)
                            FILE.close()
                            print '\tDownloaded %s to %s.' % (ITEM['url'],FILENAME)
                            D += 1
                        else:
                            print '\tWRONG FILE TYPE: %s has type: %s!' % (ITEM['url'],FILETYPE)
                            S += 1
                    except HTTPError as ERROR:
                            print '\tHTTP ERROR: Code %s for %s.' % (ERROR.code,ITEM['url'])
                            F += 1
                    except URLError as ERROR:
                            print '\tURL ERROR: %s!' % ITEM['url']
                            F += 1
                    except InvalidURL as ERROR:
                            print '\tInvalid URL: %s!' % ITEM['url']
                            F += 1
                else:
                    print '\tALREADY EXISTS: %s for %s already exists.' % (FILENAME,ITEM['url'])
                    E += 1
                    if ARGS.update == True:
                        print '\tUpdate complete, exiting.'
                        FINISHED = True
                        break
            LAST = ITEM['id']
            N += 1
            if ARGS.num > 0 and N >= ARGS.num:
                print '\t%d images attempted , exiting.' % N
                FINISHED = True
                break;
        ITEMS = getitems( ARGS.reddit, LAST )

    print 'Downloaded %d of %d (Skipped %d, Exists %d)' % (D, N, S, E)

Querying Reddit with Python

I’ve long been a fan of reddit: which is a social news site where users can submit news, they can also comment and vote on submissions of other users.  Reddit provides a form of content filtration though subreddits which are specialized by topic e.g. the Python programming language.

I thought it would be fun to figure out how to get the most recent items for a particular subreddit and the previous items for an item in a subreddit. Both these things turned out to be really simple using existing Python packages to query reddit and process the JSON formatted response.

"""Return list of items from a sub-reddit of reddit.com."""

from urllib2 import urlopen, HTTPError 
from json import JSONDecoder

def getitems( subreddit, previd=''):
    """Return list of items from a subreddit."""
    url = 'http://www.reddit.com/r/%s.json' % subreddit
    # Get items after item with 'id' of previd.
    if previd != '':
        url = '%s?after=t3_%s' % (url, previd)
    try:
        json = urlopen( url ).read()
        data = JSONDecoder().decode( json )
        items = [ x['data'] for x in data['data']['children'] ]
    except HTTPError as ERROR:
        print '\tHTTP ERROR: Code %s for %s.' % (ERROR.code, url)
        items = []
    return items

if __name__ == "__main__":

    print 'Recent items for Python.'
    ITEMS = getitems( 'python' )
    for ITEM in ITEMS:
        print '\t%s - %s' % (ITEM['title'], ITEM['url'])

    print 'Previous items for Python.'
    OLDITEMS = getitems( 'python', ITEMS[-1]['id'] )
    for ITEM in OLDITEMS:
        print '\t%s - %s' % (ITEM['title'], ITEM['url'])

In my next post I’ll detail what I used this script for..

The Rands Test

Rands has posted his own ‘Rands Test‘ in the style of Joel Spolsky’s famous ‘Joel Test‘ for telling if your company is screwed or not.  Rand’s test is focuses on communication while Joel’s original test focused on engineering:

“There is a higher order goal at the intersection of the two questions The Rands Test intends to answer: Where am I? and What the hell is going on? While understanding the answers to these questions will give you a good idea about the communication health of your company, the higher order goal is selfish.”

Getting started with Python

The following is how I’d recommend getting started programming in Python:

  1. The Python Tutorial .
    First off work your way through the official Python tutorial, its very comprehensive and covers all the language features and also has a quick tour of the modules available in the standard library.
  2. Code Like a Pythonista: Idiomatic Python
    Next I’d highly recommend reading the ‘Code Like a Pythonista‘ article is its entirety, its very useful for learning about the Pythonic way of thinking.
  3. The Python Style Guide.
    Next read the Python Style Guide (know as PEP-8), which will teach you about the general python coding style which depending on the languages you’ve used before could be quite a different style.
  4. The Python Challenge.
    Now try the Python challenge, this will push your new Python skills and riddle solving abilites.  If you get stuck the official forums are helpful, I found I got stuck on the riddles more than the programming.  Once you’ve solved each of the challenges I’d strongly recommend going and checking out the submitted solutions to that challenges.  I found this a incredibly helpful learning experience, as by looking at the solutions I learned the pythonic way to solve the problems.  Note: you can’t access these solutions till you’ve solved them yourself.
  5. ‘Learn Python the Hard Way’ or ‘Dive into Python’.
    For gaining further knowledge there are several ebooks available online for free: the first is Learn Python the Hard Way and there is also the dated Dive into Python.  I’ve not read Learn Python the Hard Way but I’ve heard good reviews from several people.

For getting help with Python programming I’d recommend:

  • Stack Overflow.
    Stack Overflow is a collaborative quesion and answers site for programmers and has a very active python community.  It is highly recommended to searching to see if your question has been asked already before  posting a question.
  • #python on irc.freenode.net.
    Visiting the #python IRC channel on irc.freenode.net is also a very good way to get help with Python questions.  You can find our more about the various Python IRC channels here.  Note: You’ll need an IRC client like X-Chat (Linux & Windows) or Colloquy (Mac).

Here are some tools I’d recommending picking up:

  • Package installer – PIP or easy_install.
    PIP is the current Python package installer of choice and lets you easily download and install Python from various sources such as the official Python package repository – PyPi and SourceForge.  I found that PIP makes installing new Python packages trivial 99% of the time, the other 1% of the time you’ll need to build the packages locally which is more involved.  Note: Windows users may be better off sticking to the older easy_install tool instead of PIP.
  • Enhanced command line – iPython or bPython.
    iPython is an enhanced command line environment for Python that I’d highly recommend over the basic command line interpreter.  You can find several different  of video tutorials for iPython listed here. I am told that bPython is another enhanced command line that is worth checking out too.
  • Code analyser – PyLint or pyflakes.
    PyLint is a python version of the Lint C/C++ static code analysis tool which will analyse your Python code and give you useful feedback on your code as well as a score out of 10.  PyLint will also check your code adheres to the official Python Style Guide which I found very useful for learning the Python coding style.  Alternatively pyflakes has also been recommended for static analysis of python code.

I’d be interested in hearing of any other resources you found useful to help you get started with python.