4

We have a pretty complex print workflow where the controlling is adding comments and annotations for draft versions of generated PDF documents using Adobe Reader or Adobe Acrobat. As part of the workflow imported PDF documents with annotations and comments should be parsed and the annotations should be imported into a CMS system (together with the PDF).

Q: are there any reliable tools (preferred Python or Java) for extracting such data in clean and reliable way to PDF files?

1 Answers1

4

This code should do the job. One of the answers to the question Parse annotations from a pdf was very helpful in getting me to write the code below. It uses the poppler library to parse the annotations. This is a link to annotations.pdf.

code

import poppler, os.path

path = 'file://%s' % os.path.realpath('annotations.pdf')
doc = poppler.document_new_from_file(path, None)
pages = [doc.get_page(i) for i in range(doc.get_n_pages())]

for page_no, page in enumerate(pages):
    items = [i.annot.get_contents() for i in page.get_annot_mapping()]
    items = [i for i in items if i]
    print "page: %s comments: %s " % (page_no + 1, items)

output

page: 1 comments: ['This is an annotation'] 
page: 2 comments: [' Please note ', ' Please note ', 'This is a comment in the text'] 

installation

On Ubuntu the installation as as follows.

apt-get install python-poppler
Community
  • 1
  • 1
Marwan Alsabbagh
  • 25,364
  • 9
  • 55
  • 65
  • This is exactly what I need but I'm having a huge amount of trouble installing poppler. Any assistance would be greatly appreciated - I've just put a question on it [here](http://stackoverflow.com/questions/32176950/install-poppler-for-python-on-mac) – simmons Aug 24 '15 at 08:00
  • 1
    @simmons I've put the installation instructions for Ubuntu. I wasn't able to install it via pip – Marwan Alsabbagh Aug 24 '15 at 09:40
  • 1
    You need `libpoppler-cpp-dev` on Ubuntu before you run `pip install python-poppler`. – Martin Thoma Sep 01 '20 at 10:25
  • 3
    After installing python-poppler, I get `AttributeError: module 'poppler' has no attribute 'document_new_from_file'` – Martin Thoma Sep 01 '20 at 10:30
  • I get `E: Unable to locate package python-poppler` when I run the apt-get command – Homero Esmeraldo May 19 '22 at 22:46