Scraping PDF with Python
There are several PDF modules available for python, so far I’ve found Slate to be the simplest to use and PDFMiner to be potentially the most powerful but also the most complicated to use. For the problem I needed to solve: extracting text with whitespace characters intact I found the following fragment of PDFMiner code on StackOverflow to be only solution:
"""Extract text from PDF file using PDFMiner with whitespace inatact.""" from pdfminer.pdfparser import PDFDocument, PDFParser from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter, process_pdf from pdfminer.pdfdevice import PDFDevice, TagExtractor from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter from pdfminer.cmapdb import CMapDB from pdfminer.layout import LAParams from cStringIO import StringIO def scrap_pdf(path): """From http://stackoverflow.com/a/8325135/39040.""" rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = file(path, 'rb') process_pdf(rsrcmgr, device, fp) fp.close() device.close() str = retstr.getvalue() retstr.close() return str
If you don’t need whitespace to be left intact I’d strongly recommend Slate over PDfMiner as its significantly easier to work with, although it does offer a smaller feature set.
Both comments and pings are currently closed.