Goodbye Vancouver. Hej Stockholm!
At the start of February I moved from Vancouver, Canada to Stockholm, Sweden!
Last year was a traumatic year for me with the death of my beloved wife Barbara to brain cancer in July. As five of the six years we were married were spent in Vancouver there are many happy memories there but there are unpleasant memories too.
I felt it was time for a new adventure so the opportunity to work with the awesome folk at DICE on the Frostbite engine was an opportunity that I felt I’d regret if I didn’t give it a shot. Plus I’d always wanted to learn another language, as something about thinking in multiple languages has always intrigued me.
Leaving the friends I’ve made and the colleagues I’ve worked with in the five years I spent in Vancouver behind has been hard. Yet exploring Stockholm and meeting new people has been interesting so far!
Organising photographs with Python
Previously I posted about extracting EXIF information from images using the Python Image Library (PIL). The reason I was investigating how to do this was I wanted to programmatically reorganise my personal photograph collection from its current ad-hoc mess to something more structured.
My goal was to use Python to extract the EXIF information from each image file and use the creation time of each image as key to organise each image into the directory structure Year/Month/Day. If an image file is missing EXIF data then the file’s creation time can be used instead via an option.
An example of running this script to reoranise the photos folder and leave the original files in place would be:
python PhotoShuffle.py -copy /Daniel/Pictures /Daniel/OrganisedPictures
You can also find the latest version on github at github.com/dpbrown/PhotoShuffle, the following is the current script:
"""Scans a folder and builds a date sorted tree based on image creation time."""
if __name__ == '__main__':
from os import makedirs, listdir, rmdir
from os.path import join as joinpath, exists, getmtime
from datetime import datetime
from shutil import move, copy2 as copy
from ExifScan import scan_exif_data
from argparse import ArgumentParser
PARSER = ArgumentParser(description='Builds a date sorted tree of images.')
PARSER.add_argument( 'orig', metavar='O', help='Source root directory.')
PARSER.add_argument( 'dest', metavar='D',
help='Destination root directory' )
PARSER.add_argument( '-filetime', action='store_true',
help='Use file time if missing EXIF' )
PARSER.add_argument( '-copy', action='store_true',
help='Copy files instead of moving.' )
ARGS = PARSER.parse_args()
print 'Gathering & processing EXIF data.'
# Get creation time from EXIF data.
DATA = scan_exif_data( ARGS.orig )
# Process EXIF data.
for r in DATA:
info = r['exif']
# precidence is DateTimeOriginal > DateTime.
if 'DateTimeOriginal' in info.keys():
r['ftime'] = info['DateTimeOriginal']
elif 'DateTime' in info.keys():
r['ftime'] = info['DateTime']
if 'ftime' in r.keys():
r['ftime'] = datetime.strptime(r['ftime'],'%Y:%m:%d %H:%M:%S')
elif ARGS.filetime == True:
ctime = getmtime( joinpath( r['path'], r['name'] + r['ext'] ))
r['ftime'] = datetime.fromtimestamp( ctime )
# Remove any files without datetime info.
DATA = [ f for f in DATA if 'ftime' in f.keys() ]
# Generate new path YYYY/MM/DD/ using EXIF date.
for r in DATA:
r['newpath'] = joinpath( ARGS.dest, r['ftime'].strftime('%Y/%m/%d') )
# Generate filenames per directory: 1 to n+1 (zero padded) with DDMMMYY.
print 'Generating filenames.'
for newdir in set( [ i['newpath'] for i in DATA ] ):
files = [ r for r in DATA if r['newpath'] == newdir ]
pad = len( str( len(files) ) )
usednames = []
for i in range( len(files) ):
datestr = files[i]['ftime'].strftime('%d%b%Y')
newname = '%0*d_%s' % (pad, i+1, datestr)
j = i+1
# if filename exists keep looking until it doesn't. Ugly!
while ( exists( joinpath( newdir, newname + files[i]['ext'] ) ) or
newname in usednames ):
j += 1
jpad = max( pad, len( str( j ) ) )
newname = '%0*d_%s' % (jpad, j, datestr)
usednames.append( newname )
files[i]['newname'] = newname
# Copy the files to their new locations, creating directories as requried.
print 'Copying files.'
for r in DATA:
origfile = joinpath( r['path'], r['name'] + r['ext'] )
newfile = joinpath( r['newpath'], r['newname'] + r['ext'] )
if not exists( r['newpath'] ):
makedirs( r['newpath'] )
if not exists( newfile ):
if ARGS.copy:
print 'Copying '+ origfile +' to '+ newfile
copy( origfile, newfile )
else:
print 'Moving '+ origfile +' to '+ newfile
move( origfile, newfile )
else:
print newfile +' already exists!'
if ARGS.copy:
print 'Removing empty directories'
DIRS = set( [ d['path'] for d in DATA ] )
for d in DIRS:
# if the directory is empty then delete it.
if len( listdir( d ) ) == 0:
print 'Deleting dir ' + d
rmdir( d )
UPDATE: I tend to run my duplicate file script over image collections before I organise them to remove any duplicates. You can find that script on github at github.com/dpbrown/Duplicate-Files.
Downloading Wallpaper Images from Reddit with Python
In my previous post I demonstrated how to query Reddit using Python and JOSN. My goal was a script to download the latest and greatest wallpapers off of image sub-reddits like wallpaper to keep my desktop wallpaper fresh and interesting. The main function of the script is to download any JPEG formatted image that listed in the specified sub-reddit and download them to a folder.
Allot of the script turned out to be managing URLs, handling exceptions and checking image types so that links to the most commonly encountered image repository: imgur worked. I opted to use the reddit hash id for each post as the filename for the downloaded JPEGs as this seems to be unique value, which means there are no collisions and its easy to programatically check if that item’s image has already been download or not. Although using a hash value instead of the items text title doesn’t make the most memorable filenames..
The single most frustrating thing I encountered when writing this script is that I have yet to discover a programatic way to work out the URL for an image on Flickr given a Flickr page URL. This is a real shame as Flickr is a really popular image hosting site with allot of great images.
An example of running the script to download images with a score greater than 50 from the wallpaper sub-reddit into a folder called wallpaper would be as follows:
python redditdownload.py wallpaper wallpaper -s 50
And to run the same query but only get any new images you don’t already have, run the following:
python redditdownload.py wallpaper wallpaper -s 50 -update
You can find the source code for this post (and the previous) on GitHub at github.com/dpbrown/RedditImageGrab and the current source for the script is as follows:
"""Download images from a reddit.com subreddit."""
from urllib2 import urlopen, HTTPError, URLError
from httplib import InvalidURL
from argparse import ArgumentParser
from os.path import exists as pathexists, join as pathjoin
from os import mkdir
from reddit import getitems
if __name__ == "__main__":
PARSER = ArgumentParser( description='Downloads files with specified externsion from the specified subreddit.')
PARSER.add_argument( 'reddit', metavar='r', help='Subreddit name.')
PARSER.add_argument( 'dir', metavar='d', help='Dir to put downloaded files in.')
PARSER.add_argument( '-last', metavar='l', default='', required=False, help='ID of the last downloaded file.')
PARSER.add_argument( '-score', metavar='s', default='0', type=int, required=False, help='Minimum score of images to download.')
PARSER.add_argument( '-num', metavar='n', default='0', type=int, required=False, help='Number of images to process.')
PARSER.add_argument( '-update', default=False, action='store_true', required=False, help='Run until you encounter a file already downloaded.')
ARGS = PARSER.parse_args()
print 'Downloading images from "%s" subreddit' % (ARGS.reddit)
ITEMS = getitems( ARGS.reddit, ARGS.last )
N = D = E = S = F = 0
FINISHED = False
# Create the specified directory if it doesn't already exist.
if not pathexists( ARGS.dir ):
mkdir( ARGS.dir )
while len(ITEMS) > 0 and FINISHED == False:
LAST = ''
for ITEM in ITEMS:
if ITEM['score'] < ARGS.score:
print '\tSCORE: %s has score of %s which is lower than required score of %s.' % (ITEM['id'],ITEM['score'],ARGS.score)
S += 1
else:
FILENAME = pathjoin( ARGS.dir, '%s.jpg' % (ITEM['id'] ) )
# Don't download files multiple times!
if not pathexists( FILENAME ):
try:
if 'imgur.com' in ITEM['url']:
# Change .png to .jpg for imgur urls.
if ITEM['url'].endswith('.png'):
ITEM['url'] = ITEM['url'].replace('.png','.jpg')
# Add .jpg to imgur urls that are missing it.
elif '.jpg' not in ITEM['url']:
ITEM['url'] = '%s.jpg' % ITEM['url']
elif '.jpeg' not in ITEM['url']:
ITEM['url'] = '%s.jpg' % ITEM['url']
RESPONSE = urlopen( ITEM['url'] )
INFO = RESPONSE.info()
# Work out file type either from the response or the url.
if 'content-type' in INFO.keys():
FILETYPE = INFO['content-type']
elif ITEM['url'].endswith( 'jpg' ):
FILETYPE = 'image/jpeg'
elif ITEM['url'].endswith( 'jpeg' ):
FILETYPE = 'image/jpeg'
else:
FILETYPE = 'unknown'
# Only try to download jpeg images.
if FILETYPE == 'image/jpeg':
FILEDATA = RESPONSE.read()
FILE = open( FILENAME, 'wb')
FILE.write(FILEDATA)
FILE.close()
print '\tDownloaded %s to %s.' % (ITEM['url'],FILENAME)
D += 1
else:
print '\tWRONG FILE TYPE: %s has type: %s!' % (ITEM['url'],FILETYPE)
S += 1
except HTTPError as ERROR:
print '\tHTTP ERROR: Code %s for %s.' % (ERROR.code,ITEM['url'])
F += 1
except URLError as ERROR:
print '\tURL ERROR: %s!' % ITEM['url']
F += 1
except InvalidURL as ERROR:
print '\tInvalid URL: %s!' % ITEM['url']
F += 1
else:
print '\tALREADY EXISTS: %s for %s already exists.' % (FILENAME,ITEM['url'])
E += 1
if ARGS.update == True:
print '\tUpdate complete, exiting.'
FINISHED = True
break
LAST = ITEM['id']
N += 1
if ARGS.num > 0 and N >= ARGS.num:
print '\t%d images attempted , exiting.' % N
FINISHED = True
break;
ITEMS = getitems( ARGS.reddit, LAST )
print 'Downloaded %d of %d (Skipped %d, Exists %d)' % (D, N, S, E)
Querying Reddit with Python
I’ve long been a fan of reddit: which is a social news site where users can submit news, they can also comment and vote on submissions of other users. Reddit provides a form of content filtration though subreddits which are specialized by topic e.g. the Python programming language.
I thought it would be fun to figure out how to get the most recent items for a particular subreddit and the previous items for an item in a subreddit. Both these things turned out to be really simple using existing Python packages to query reddit and process the JSON formatted response.
"""Return list of items from a sub-reddit of reddit.com."""
from urllib2 import urlopen, HTTPError
from json import JSONDecoder
def getitems( subreddit, previd=''):
"""Return list of items from a subreddit."""
url = 'http://www.reddit.com/r/%s.json' % subreddit
# Get items after item with 'id' of previd.
if previd != '':
url = '%s?after=t3_%s' % (url, previd)
try:
json = urlopen( url ).read()
data = JSONDecoder().decode( json )
items = [ x['data'] for x in data['data']['children'] ]
except HTTPError as ERROR:
print '\tHTTP ERROR: Code %s for %s.' % (ERROR.code, url)
items = []
return items
if __name__ == "__main__":
print 'Recent items for Python.'
ITEMS = getitems( 'python' )
for ITEM in ITEMS:
print '\t%s - %s' % (ITEM['title'], ITEM['url'])
print 'Previous items for Python.'
OLDITEMS = getitems( 'python', ITEMS[-1]['id'] )
for ITEM in OLDITEMS:
print '\t%s - %s' % (ITEM['title'], ITEM['url'])
In my next post I’ll detail what I used this script for..
The Rands Test
Rands has posted his own ‘Rands Test‘ in the style of Joel Spolsky’s famous ‘Joel Test‘ for telling if your company is screwed or not. Rand’s test is focuses on communication while Joel’s original test focused on engineering:
“There is a higher order goal at the intersection of the two questions The Rands Test intends to answer: Where am I? and What the hell is going on? While understanding the answers to these questions will give you a good idea about the communication health of your company, the higher order goal is selfish.”








