Counting words and images in RSS posts with python

I have often wondered how many words I should be aiming to have per blog post or how many images I should include. This lead me to reach for python and whip up the following script which will grab the posts from a sites RSS feed, count the image tags, strip the HTML and then count the words left.

This script lets me see the minimum, maximum and average number of images and words for some of my favourite blogs, which have an average word count of about 330 words and a dozen images. This is reassuring as I was never sure how many words justified a post and its clear indication that I should consider using images much more..

### Script to fetch an RSS feed and work out the min, max, average word and 
### image counts of posts in the feed.

import xml.etree.ElementTree 
import urllib2
from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    ### From http://stackoverflow.com/questions/753052/
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    ### From http://stackoverflow.com/questions/753052/
    s = MLStripper()
    s.feed(html)
    return s.get_data()

def getPostStats( url ):
    ### Fetch specified feed and count words and images per post.

    # Download the feed.
    raw = urllib2.urlopen(url)

    # Parse the feed xml.
    parsed = xml.etree.ElementTree.parse(raw)
    root = parsed.getroot()
    channel = root.find('channel')

    titles = []
    images = []
    words = []

    # Find the articles (items)
    for item in channel.findall('item'):
        text = item.findtext('description','')

        # RSS 2.0 uses content instead of description for the post body!
        namespaces = {'content': 'http://purl.org/rss/1.0/modules/content/'} 
        content = item.findtext('content:encoded','', namespaces=namespaces)

        if len(text) < len(content):
            text = content

        # Count the number of images
        images.append(text.count('<img'))
 
        # Count the number of words
        text = strip_tags(text)
        words.append(len(text.split()))
        
        # Get the post title
        titles.append(item.findtext('title',''))

    return titles,words,images

def getMinMaxAvg( counts ):
    ### return the min, max & average counts.

    minCount = min(counts)
    maxCount = max(counts)
    avgCount = sum(counts,0) / len(counts)

    return minCount,maxCount,avgCount 

if __name__ == "__main__":
    # List of interesting blogs.
    TESTURLS = [ {"name":"Lady Slider", "url":"http://www.ladyslider.com/blog?format=RSS"},
                 {"name":"Shoot Tokyo", "url":"http://shoottokyo.com/feed/"}, 
                 {"name":"DeadPxl", "url":"http://dedpxl.com/feed/"}, 
                 {"name":"circa 1983", "url":"http://blog.circa1983.ca/rss"}, 
                 {"name":"David Duchemin", "url":"http://davidduchemin.com/feed/"} ]

    # Go find the image and word counts for each blog!
    for TEST in TESTURLS:
        TITLES,WORDS,IMAGES = getPostStats( TEST["url"] )

        print "--- %s ---" % TEST["name"]

        for n in range(0,len(TITLES)):
            print "  '%s' - %d words, %d images." % (TITLES[n],WORDS[n],IMAGES[n])

        print "Posts - %d." % len(TITLES)
        
        MIN,MAX,AVG = getMinMaxAvg(IMAGES)
        print "Images  - Min: %d Max: %d Avg: %d." % (MIN,MAX,AVG)
        
        MIN,MAX,AVG = getMinMaxAvg(WORDS)
        print "Words   - Min: %d Max: %d Avg: %d." % (MIN,MAX,AVG)

You should get output like the following for each RSS feed:

--- David Duchemin ---
'PHOTOGRAPH, Issue 10' - 203 words, 16 images.
'Make It Now.' - 524 words, 1 images.
'A World of Stories' - 378 words, 3 images.
'Light , Gesture, & Color' - 449 words, 4 images.
'Cape Churchill Polar Bears' - 714 words, 7 images.
'Hudson Bay Polar Bears' - 330 words, 1 images.
'Study the Masters: Margaret Bourke-White' - 655 words, 4 images.
'About Critique' - 693 words, 1 images.
'Inspired by the Tangible' - 518 words, 1 images.
'The Created Image, Vol.02' - 390 words, 3 images.
Posts - 10.
Images - Min: 1 Max: 16 Avg: 4.
Words - Min: 203 Max: 714 Avg: 485.

Listen, learn … then lead!

An interesting TED talk from four star General Stanley McChrystal about how the events following 9/11 lead to a new style of war and a requirement for a very different form of leadership of the widely distributed military response.

I think this is a worth while talk for any leader to watch and hear about how the General adapted in the face of change..

Generating passwords with Python

Occasionally I find myself lacking inspiration for a password that I will not use frequently which I want to be secure and that I don’t mind storing in a secure password manager. When this happens I use the very handy UUID module in the Python standard library to generate me a semi-decent password.

"""Generate a string suitible for password usage using the UUID module."""

from uuid import uuid4

print str(uuid4())

This will produce output like the following:
e1de6232-4a74-45cf-8bc3-1dd8a76af4de

The main drawback with this approach is the generated passwords are not easily rememberable by the average human being so you need to store it somewhere safe and secure. If you lose the password or forget it your stuffed!

Converting Lightroom GPS coordinates for Google Maps

I have wanted to add a map of the locations of the photographs on my photo blog SeeStockholm.se for a while now.  I have the coordinates in Lightroom for the images in the degrees, minutes, seconds (DMS) format e.g. 59°16’31” N 18°19’8″. However Google Maps uses the decimal degrees (DD) format e.g. 59.2753 N 18.3189 E.  

I needed a way to convert the coordinates from one representation to the other. After a bit of googling and some experiementation I wrote the following Javascript functions to convert from DMS format to DD format and create a google maps google.maps.LatLng object.

    function ConvertDMSToDD(days, minutes, seconds, direction) 
    {
        var dd = parseFloat(days) + parseFloat(minutes/60) + parseFloat(seconds/(60*60));
        if (direction == "S" || direction == "W") {
            dd = dd * -1;
        } // Don't do anything for N or E
        return dd;
    }

    function ParseDMS(input) 
    {
        var parts = input.split(/[^\d\w]+/);
        var lat = ConvertDMSToDD(parts[0], parts[1], parts[2], parts[3]);
        var lng = ConvertDMSToDD(parts[4], parts[5], parts[6], parts[7]);
        return new google.maps.LatLng( lat, lng );
    }

This makes the conversion process simply a case of calling ParseDMS with a DMS format coordinate in string form and it will return a LatLng object ready for use in Google Maps. These conversion functions allowed me to easily implement the map feature for my photo blog.

Exceeding the forty hour work week

To follow on from ‘How to Make work-life balence work‘ video Alison Morris from Online MBA has a pretty interesting inforgraphic regarding the effect of the current trend in America to work more than forty hours a week: it is pretty sobering stuff!

While Europe tends to better at work-life balance than North America there is still room for improvement on both sides of the Atlantic.  I believe it is in an employers best interests to not over work their staff if they want to get the best quality of work.

Scraping PDF with Python

There are several PDF modules available for python, so far I’ve found Slate to be the simplest to use and PDFMiner to be potentially the most powerful but also the most complicated to use.  For the problem I needed to solve: extracting text with whitespace characters intact I found the following fragment of PDFMiner code on StackOverflow to be only solution:

"""Extract text from PDF file using PDFMiner with whitespace inatact."""

from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
from cStringIO import StringIO

def scrap_pdf(path):
    """From http://stackoverflow.com/a/8325135/39040."""
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str

If you don’t need whitespace to be left intact I’d strongly recommend Slate over PDfMiner as its significantly easier to work with, although it does offer a smaller feature set.