Extracting PDF annotations/comments

Question

We have a pretty complex print workflow where the controlling is adding comments and annotations for draft versions of generated PDF documents using Adobe Reader or Adobe Acrobat. As part of the workflow imported PDF documents with annotations and comments should be parsed and the annotations should be imported into a CMS system (together with the PDF).

Q: are there any reliable tools (preferred Python or Java) for extracting such data in clean and reliable way to PDF files?

Can you put a link to an sample pdf that contains a annotation and a comment, so we can work on it. — Marwan Alsabbagh, Dec 06 '12 at 16:45

score 4 · Accepted Answer · edited May 23 '17 at 12:25

4

This code should do the job. One of the answers to the question Parse annotations from a pdf was very helpful in getting me to write the code below. It uses the poppler library to parse the annotations. This is a link to annotations.pdf.

code

import poppler, os.path

path = 'file://%s' % os.path.realpath('annotations.pdf')
doc = poppler.document_new_from_file(path, None)
pages = [doc.get_page(i) for i in range(doc.get_n_pages())]

for page_no, page in enumerate(pages):
    items = [i.annot.get_contents() for i in page.get_annot_mapping()]
    items = [i for i in items if i]
    print "page: %s comments: %s " % (page_no + 1, items)

output

page: 1 comments: ['This is an annotation'] 
page: 2 comments: [' Please note ', ' Please note ', 'This is a comment in the text']

installation

On Ubuntu the installation as as follows.

apt-get install python-poppler

edited May 23 '17 at 12:25

Community

1
1

answered Dec 06 '12 at 17:16

Marwan Alsabbagh

25,364
9
55
65

This is exactly what I need but I'm having a huge amount of trouble installing poppler. Any assistance would be greatly appreciated - I've just put a question on it [here](http://stackoverflow.com/questions/32176950/install-poppler-for-python-on-mac) – simmons Aug 24 '15 at 08:00
1

@simmons I've put the installation instructions for Ubuntu. I wasn't able to install it via pip – Marwan Alsabbagh Aug 24 '15 at 09:40
1

You need `libpoppler-cpp-dev` on Ubuntu before you run `pip install python-poppler`. – Martin Thoma Sep 01 '20 at 10:25
3

After installing python-poppler, I get `AttributeError: module 'poppler' has no attribute 'document_new_from_file'` – Martin Thoma Sep 01 '20 at 10:30
I get `E: Unable to locate package python-poppler` when I run the apt-get command – Homero Esmeraldo May 19 '22 at 22:46

Extracting PDF annotations/comments

1 Answers1

Linked