Scraping PDF with Python
There are several PDF modules available for python, so far I’ve found Slate to be the simplest to use and PDFMiner to be potentially the most powerful but also the most complicated to use. For the problem I needed to solve: extracting text with whitespace characters intact I found the following fragment of PDFMiner code on StackOverflow to be only solution:
"""Extract text from PDF file using PDFMiner with whitespace inatact."""
from pdfminer.pdfparser import PDFDocument, PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
from cStringIO import StringIO
def scrap_pdf(path):
"""From http://stackoverflow.com/a/8325135/39040."""
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
process_pdf(rsrcmgr, device, fp)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
If you don’t need whitespace to be left intact I’d strongly recommend Slate over PDfMiner as its significantly easier to work with, although it does offer a smaller feature set.
You can leave a response, or trackback from your own site.







Thanks for the brief post but extremely useful post.
I am using this example to help me scrape legal docs.
James Osburn
Glad to hear it was useful to you James!