Parse annotations from a pdf

Question

I want a python function that takes a pdf and returns a list of the text of the note annotations in the document. I have looked at python-poppler (https://code.launchpad.net/~poppler-python/poppler-python/trunk) but I can not figure out how to get it to give me anything useful.

I found the get_annot_mapping method and modified the demo program provided to call it via self.current_page.get_annot_mapping(), but I have no idea what to do with an AnnotMapping object. It seems to not be fully implemented, providing only the copy method.

If there are any other libraries that provide this function, that's fine as well.

score 25 · Answer 1 · edited May 08 '22 at 10:50

You should DEFINITELY have a look at PyPDF2. This amazing library has incredible potential, you can extract whatever from a PDF, including images or comments. Try to start by examining what Acrobat Reader DC (Reader) can give you on a PDF’s comments. Take a simple PDF, annotate it (add some comments) with Reader and in the comments tab in the upper right corner, click the horizontal three dots and click Export All To Data File... and select the format with the extension xfdf. This creates a wonderful xml file which you can parse. The format is very transparent and self-evident.

If, however, you cannot rely on a user clicking this and instead need to extract the same data from a PDF programmatically using python, do not despair, there is a solution. (Inspired by Extract images from PDF without resampling, in python?)

Prerequisites

pip install PyPDF2

xfdf XML

What Reader gives you in the above mentioned xfdf file, looks like this:

<?xml version="1.0" ?>
<xfdf xml:space="preserve" xmlns="http://ns.adobe.com/xfdf/">
    <annots>
        <caret IT="Replace" color="#0000FF" creationdate="D:20190221151519+01'00'" date="D:20190221151526+01'00'" flags="print" fringe="1.069520,1.069520,1.069520,1.069520" name="72f8d1b7-d878-4281-bd33-3a6fb4578673" page="0" rect="636.942000,476.891000,652.693000,489.725000" subject="Inserted Text" title="Admin">
            <contents-richtext>
                <body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
                    <p dir="ltr">
                        <span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal"> comment1</span>
                    </p>
                </body>
            </contents-richtext>
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,374.656000,941.008000,488.656000"/>
        </caret>
        <highlight color="#FFD100" coords="183.867000,402.332000,220.968000,402.332000,183.867000,387.587000,220.968000,387.587000" creationdate="D:20190221151441+01'00'" date="D:20190221151448+01'00'" flags="print" name="a18c7fb0-0af3-435e-8c32-1af2af3c46ea" opacity="0.399994" page="0" rect="179.930000,387.126000,224.904000,402.793000" subject="Highlight" title="Admin">
            <contents-richtext>
                <body xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/">
                    <p dir="ltr">
                        <span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span>
                    </p>
                </body>
            </contents-richtext>
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,288.332000,941.008000,402.332000"/>
        </highlight>
        <caret color="#0000FF" creationdate="D:20190221151452+01'00'" date="D:20190221151452+01'00'" flags="print" fringe="0.828156,0.828156,0.828156,0.828156" name="6bf0226e-a3fb-49bf-bc89-05bb671e1627" page="0" rect="285.877000,372.978000,298.073000,382.916000" subject="Inserted Text" title="Admin">
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,268.088000,941.008000,382.088000"/>
        </caret>
        <strikeout IT="StrikeOutTextEdit" color="#0000FF" coords="588.088000,497.406000,644.818000,497.406000,588.088000,477.960000,644.818000,477.960000" creationdate="D:20190221151519+01'00'" date="D:20190221151519+01'00'" flags="print" inreplyto="72f8d1b7-d878-4281-bd33-3a6fb4578673" name="6686b852-3924-4252-af21-c1b10390841f" page="0" rect="582.290000,476.745000,650.616000,498.621000" replyType="group" subject="Cross-Out" title="Admin">
            <popup flags="print,nozoom,norotate" open="no" page="0" rect="737.008000,383.406000,941.008000,497.406000"/>
        </strikeout>
    </annots>
    <f href="p1.pdf"/>
    <ids modified="ABB10FA107DAAA47822FB5D311112349" original="474F087D87E7E544F6DEB9E0A93ADFB2"/>
</xfdf>

Various types of comments are presented here as tags within an <annots> block.

Using PyPDF2

Python can give you almost the same data. To obtain it, have a look at what the output of the following script gives:

from PyPDF2 import PdfFileReader

reader = PdfFileReader("/path/to/my/file.pdf")

for page in reader.pages:
    try :
        for annot in page["/Annots"] :
            print (annot.getObject())       # (1)
            print ("")
    except : 
        # there are no annotations on this page
        pass

The output for the same file as in the xfdf file above will look like this:

{'/Popup': IndirectObject(192, 0), '/M': u"D:20190221151448+01'00'", '/CreationDate': u"D:20190221151441+01'00'", '/NM': u'a18c7fb0-0af3-435e-8c32-1af2af3c46ea', '/F': 4, '/C': [1, 0.81961, 0], '/Rect': [179.93, 387.126, 224.904, 402.793], '/Type': '/Annot', '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment2</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u'otrasneho', '/QuadPoints': [183.867, 402.332, 220.968, 402.332, 183.867, 387.587, 220.968, 387.587], '/Subj': u'Highlight', '/CA': 0.39999, '/AP': {'/N': IndirectObject(202, 0)}, '/Subtype': '/Highlight'}

{'/Parent': IndirectObject(191, 0), '/Rect': [737.008, 288.332, 941.008, 402.332], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A425D0>, '/Subtype': '/Popup'}

{'/Popup': IndirectObject(194, 0), '/M': u"D:20190221151452+01'00'", '/CreationDate': u"D:20190221151452+01'00'", '/NM': u'6bf0226e-a3fb-49bf-bc89-05bb671e1627', '/F': 4, '/C': [0, 0, 1], '/Subj': u'Inserted Text', '/Rect': [285.877, 372.978, 298.073, 382.916], '/Type': '/Annot', '/P': IndirectObject(5, 0), '/AP': {'/N': IndirectObject(201, 0)}, '/RD': [0.82816, 0.82816, 0.82816, 0.82816], '/T': u'Admin', '/Subtype': '/Caret'}

{'/Parent': IndirectObject(193, 0), '/Rect': [737.008, 268.088, 941.008, 382.088], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42830>, '/Subtype': '/Popup'}

{'/Popup': IndirectObject(196, 0), '/M': u"D:20190221151519+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'6686b852-3924-4252-af21-c1b10390841f', '/F': 4, '/IRT': IndirectObject(197, 0), '/C': [0, 0, 1], '/Rect': [582.29, 476.745, 650.616, 498.621], '/Type': '/Annot', '/T': u'Admin', '/P': IndirectObject(5, 0), '/QuadPoints': [588.088, 497.406, 644.818, 497.406, 588.088, 477.96, 644.818, 477.96], '/Subj': u'Cross-Out', '/IT': '/StrikeOutTextEdit', '/AP': {'/N': IndirectObject(200, 0)}, '/RT': '/Group', '/Subtype': '/StrikeOut'}

{'/Parent': IndirectObject(195, 0), '/Rect': [737.008, 383.406, 941.008, 497.406], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AF0>, '/Subtype': '/Popup'}

{'/Popup': IndirectObject(198, 0), '/M': u"D:20190221151526+01'00'", '/CreationDate': u"D:20190221151519+01'00'", '/NM': u'72f8d1b7-d878-4281-bd33-3a6fb4578673', '/F': 4, '/C': [0, 0, 1], '/Rect': [636.942, 476.891, 652.693, 489.725], '/Type': '/Annot', '/RD': [1.06952, 1.06952, 1.06952, 1.06952], '/T': u'Admin', '/RC': u'<?xml version="1.0"?><body xmlns="http://www.w3.org/1999/xhtml" xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/" xfa:APIVersion="Acrobat:19.10.0" xfa:spec="2.0.2" ><p dir="ltr"><span dir="ltr" style="font-size:10.5pt;text-align:left;color:#000000;font-weight:normal;font-style:normal">comment1</span></p></body>', '/P': IndirectObject(5, 0), '/Contents': u' pica', '/Subj': u'Inserted Text', '/IT': '/Replace', '/AP': {'/N': IndirectObject(212, 0)}, '/Subtype': '/Caret'}

{'/Parent': IndirectObject(197, 0), '/Rect': [737.008, 374.656, 941.008, 488.656], '/Type': '/Annot', '/F': 28, '/Open': <PyPDF2.generic.BooleanObject object at 0x02A42AB0>, '/Subtype': '/Popup'}

If you examine the output, you will realize that the outputs are all more or less the same. Every comment in the xfdf file has two counterparts in PyPDF2’s output in python. The /C attribute is the color of the highlight, in RGB, scaled to floats in the range <0, 1>. /Rect defines the bounding box of the comment on the page/spread, in points (1/72 of an inch) relative to the lower-left corner of the page, increasing values going right and up. /M and /CreationDate are modified and creation times, /QuadPoints is an array of [x1, y1, x2, y2, ..., xn, yn] coordinates of a line around the comment, /Subject, /Type, /SubType, /IT identify the type of the comment, /T is probably the creator, /RC is an xhtml representation of the comment’s text if there is one. If there is an ink-drawn comment, it will be presented here as having an attribute /InkList with data in the form [[L1x1, L1y1, L1x2, L1y2, ..., L1xn, L1yn], [L2x1, L2y1, ..., L2xn, L2yn], ..., [Lmx1, Lmy1, ..., Lmxn, Lmyn]] for line 1, line 2, ..., line m.

For a more thorough explanation of the various fields you get from getObject() in the given python code labeled as line (1), please consult https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf and especially the section 12.5 Annotations starting at pages 381–413.

Can I add back the annotations to a PDF file (the same original file, but stripped of annotations)? — Yan King Yin, Mar 04 '21 at 08:28
Sorry @Shayan for a late answer, maybe others may find it useful at least as you've likely figured out yourself by now, it is the line `print annot.getObject()`, you can simply write `comment = annot.getObject()` instead and `comment` will contain a `dict` as in the lines starting with `{'/Popup': IndirectObject(192, 0), ...` above. If you need to convert the `IndirectObject` into something usable, you can use `comment['/Popup'] = list(comment['/Popup'])` (or `tuple` instead of `list`), then `comment` will look like `{'/Popup': [192, 0], ...` — mxl, Oct 08 '21 at 17:25
The current link to the PDF 1.7 specification (ISO 32000-1:2008) is broken, but [this link](https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf) works. (This is also the link the Library of Congress keeps in [its entry for PDF](https://www.loc.gov/preservation/digital/formats/fdd/fdd000277.shtml).) — Jonathan Jeffrey, Sep 13 '22 at 20:06
Also, as of 2023, PyPDF2 [is simply known as pypdf](https://pypdf.readthedocs.io/en/latest/meta/history.html). — Jonathan Jeffrey, Jan 27 '23 at 11:06

Enno Gröper · Answer 2 · 2021-01-25T15:41:20.500

24

Just in case somebody is looking for some working code. Here is a script I use.

import poppler
import sys
import urllib
import os

def main():
  input_filename = sys.argv[1]
    # http://blog.hartwork.org/?p=612
  document = poppler.document_new_from_file('file://%s' % \
    urllib.pathname2url(os.path.abspath(input_filename)), None)
  n_pages = document.get_n_pages()
  all_annots = 0

  for i in range(n_pages):
        page = document.get_page(i)
        annot_mappings = page.get_annot_mapping ()
        num_annots = len(annot_mappings)
        if num_annots > 0:
            for annot_mapping in annot_mappings:
                if  annot_mapping.annot.get_annot_type().value_name != 'POPPLER_ANNOT_LINK':
                    all_annots += 1
                    print('page: {0:3}, {1:10}, type: {2:10}, content: {3}'.format(i+1, annot_mapping.annot.get_modified(), annot_mapping.annot.get_annot_type().value_nick, annot_mapping.annot.get_contents()))
    
  if all_annots > 0:
    print(str(all_annots) + " annotation(s) found")
  else:
    print("no annotations found")

if __name__ == "__main__":
    main()

edited Jan 25 '21 at 15:41

answered Sep 19 '12 at 20:40

Enno Gröper

4,391
1
27
33

4

Probably worth chucking that up on a public git repo somewhere, so others can easily help to improve it. – naught101 Aug 29 '17 at 03:09
7

Also, how are you installing Poppler? – naught101 Aug 29 '17 at 03:15
1

I assume you are using linux, aren't you? In windows, it is hard to come by poppler python bindings. – schlingel Nov 15 '17 at 11:33
@schlingel Yes. I'm using Linux. – Enno Gröper Nov 15 '17 at 14:08
3

I installed python-poppler from apt and popper from anaconda. Still I get this error: `ImportError: No module named poppler`. How do I install Poppler? – Jul 23 '18 at 21:37
Is the poppler python module now called [python-poppler-qt5](https://pypi.org/project/python-poppler-qt5/)? – Paul Rougieux Dec 17 '19 at 09:36
1

`poppler` installation for python2: `apt-get install python-poppler`. `poppler` installation for python3: `sudo apt-get install python3-poppler-qt4`. reference: https://stackoverflow.com/questions/39522374/how-to-install-poppler-for-python-3-in-linux – Walty Yeung Mar 01 '20 at 08:38
hi I installed using conda https://anaconda.org/conda-forge/poppler But when I do import poppler it says no module. even when I see it in conda env – Baktaawar Apr 28 '20 at 19:27
1

@Shayan If you are able to install python-poppler (Python bindings for poppler and of course poppler libraries themselves) on Windows, it should work there as well. But I didn't check if this is easy or possible. – Enno Gröper Oct 16 '20 at 10:08
@EnnoGröper Can you explain me as a beginner how to use your code? I'm with the terminal on the same folder as my pdf, I named the script "extract.py", then I do $python3 extract.py ComoEstoTambienEsMatematica.pdf and it outputs "python3 extract.py ComoEstoTambienEsMatematica.pdf File "extract.py", line 22 print 'page: {0:3}, {1:10}, type: {2:10}, content: {3}'.format(i+1,annot_mapping.annot.get_modified(), annot_mapping.annot.get_annot_type().value_nick,annot_mapping.annot.get_contents()) " – Santropedro Jan 20 '21 at 02:12
@Santropedro The code ist from 2012. So its for Python2. But I guess it works on Python3 when using print function. I updated the example code. – Enno Gröper Jan 25 '21 at 15:42
Maybe the API changed? Unless I'm missing something I'm not seeing any public methods called/related to `get_annot_mapping` (https://cbrunet.net/python-poppler/api/poppler.page.html and https://poppler.freedesktop.org/api/cpp/classpoppler_1_1page.html is where I'm looking) – Tom Jun 14 '21 at 23:35
@EnnoGröper I think this answer needs to be updated, since that old poppler isn't available in Ubuntu 20 anymore, it was until 18, see https://packages.ubuntu.com/search?keywords=python-poppler&searchon=names&suite=all&section=all. but also not in other distros, see https://pkgs.org/search/?q=python-poppler, instead now I believe the replacement is https://packages.ubuntu.com/search?keywords=python3-poppler-qt5&searchon=names&suite=all&section=all ! If you could change your code to work with this new poppler qt5, I think just by changing one or two instructions, it might work. – Santropedro Sep 18 '21 at 23:34

joelostblom · Answer 3 · 2020-03-27T19:17:42.243

The pdf-annots script can extract annotations from PDFs. It is built upon PDFMineer.six and produces output in markdown both for the highlighted text and any annotations made on it, such as comments on highlighted areas or popup boxes. The output would look similar to this:

 * Page 2 Highlight:
 > Underlying text that was highlighted

 Comment made on highlighted text.

 * Page 3 Highlight: "Short highlighted text" -- Short comment.

 * Page 4 Text: A note on the page.

The full command options can be seen below.

usage: pdfannots.py [-h] [-p] [-o OUTFILE] [-n COLS] [-s [SEC [SEC ...]]] [--no-group]
                    [--print-filename] [-w COLS]
                    INFILE [INFILE ...]

Extracts annotations from a PDF file in markdown format for use in reviewing.

positional arguments:
  INFILE                PDF files to process

optional arguments:
  -h, --help            show this help message and exit

Basic options:
  -p, --progress        emit progress information
  -o OUTFILE            output file (default is stdout)
  -n COLS, --cols COLS  number of columns per page in the document (default: 2)

Options controlling output format:
  -s [SEC [SEC ...]], --sections [SEC [SEC ...]]
                        sections to emit (default: highlights, comments, nits)
  --no-group            emit annotations in order, don't group into sections
  --print-filename      print the filename when it has annotations
  -w COLS, --wrap COLS  wrap text at this many output columns

I haven't tried this out extensively, but it has been working well so far!

Thanks for the tip! pdfannots is awesome! Super-easy to install (`pip install pdfannots`), and easy to use. Also appears to be actively maintained. — solarchemist, Aug 17 '21 at 03:06

score 5 · Accepted Answer · answered Jul 12 '09 at 20:57

5

Turns out the bindings were incomplete. It is now fixed. https://bugs.launchpad.net/poppler-python/+bug/397850

answered Jul 12 '09 at 20:57

davidb

541
1
5
6

t-bltg · Answer 5 · 2022-10-14T09:01:44.107

5

Here is a working example (ported from previous answer) extracting annotations with the python module popplerqt5: python3 extract.py sample.pdf

extract.py

import popplerqt5
import argparse


def extract(fn):
    doc = popplerqt5.Poppler.Document.load(fn)
    annotations = []
    for i in range(doc.numPages()):
        page = doc.page(i)
        for annot in page.annotations():
            contents = annot.contents()
            if contents:
                annotations.append(contents)
                print(f'page={i + 1} {contents}')

    print(f'{len(annotations)} annotation(s) found')
    return annotations


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('fn')
    args = parser.parse_args()
    extract(args.fn)

edited Oct 14 '22 at 09:01

answered Jan 21 '20 at 00:46

t-bltg

854
9
17

Sometimes it says 0 annotations found but sometimes it finds the annotations but how do I extract them? – Shayan Oct 09 '20 at 12:21
1

Something like print(extract(args.fn)) – t-bltg Oct 29 '20 at 09:14

score 2 · Answer 6 · answered Jul 12 '20 at 09:08

The author @JorjMcKie of PyMuPDF wrote a snippet for me and I modified a bit:

import fitz  # to import the PyMuPDF library
# from pprint import pprint


def _parse_highlight(annot: fitz.Annot, wordlist: list) -> str:
    points = annot.vertices
    quad_count = int(len(points) / 4)
    sentences = ['' for i in range(quad_count)]
    for i in range(quad_count):
        r = fitz.Quad(points[i * 4: i * 4 + 4]).rect
        words = [w for w in wordlist if fitz.Rect(w[:4]).intersects(r)]
        sentences[i] = ' '.join(w[4] for w in words)
    sentence = ' '.join(sentences)
    return sentence


def main() -> dict:
    doc = fitz.open('path/to/your/file')
    page = doc[0]

    wordlist = page.getText("words")  # list of words on page
    wordlist.sort(key=lambda w: (w[3], w[0]))  # ascending y, then x

    highlights = {}
    annot = page.firstAnnot
    i = 0
    while annot:
        if annot.type[0] == 8:
            highlights[i] = _parse_highlight(annot, wordlist)
            i += 1
            print('> ' + highlights[i] + '\n')
        annot = annot.next

    # pprint(highlights)
    return highlights


if __name__ == "__main__":
    main()

https://github.com/pymupdf/PyMuPDF
Is it posible to extract highlighted text? #318 https://github.com/pymupdf/PyMuPDF/issues/318#issuecomment-657102559

Though there are still some small typos in the results:

> system upsets,

> expansion of smart grid monitoring devices that generally provide nodal voltages and power injections at ﬁne spatial resolution,

> hurricanes to indi- vidual lightning strikes),

score 2 · Answer 7 · answered Jun 16 '22 at 07:48

from typing import Dict, List

from pdfannots import process_file


def get_pdf_annots(pdf_filename) -> Dict[int, List[str]]:
    """
    Return example:
    {
        0: ["Human3.6M", "Our method"],
        3: [
            "pretrained using 3D mocap data"
        ],
    }
    """
    annots_dict = dict()
    document = process_file(open(pdf_filename, "rb"))
    for page_idx in range(len(document.pages)):
        annots = document.pages[page_idx].annots
        for annot in annots:
            if page_idx not in annots_dict:
                annots_dict[page_idx] = []

            text = "".join(annot.text).strip()
            # 去掉换行符
            text = text.replace("-\n", "").replace("\n", " ")
            annots_dict[page_idx].append(text)
    return annots_dict

if __name__ == "__main__":
    print(get_pdf_annots("xxx.pdf"))

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Jun 17 '22 at 07:44

score 0 · Answer 8 · answered Jul 10 '09 at 05:50

0

I didn't ever used this, nor I wanted this kind of features, but I found PDFMiner - this link has information about basic usage, maybe this is what You are looking for?

answered Jul 10 '09 at 05:50

zeroDivisible

4,041
8
40
62

2

While that might be useful if I wanted to extract all of the text from a pdf, I just want to extract the annotations. The reason I mentioned poppler is because it does provide this ability rather easily (http://cgit.freedesktop.org/poppler/poppler/tree/glib/poppler-annot.h). But, I wanted to use python. I found the python-poppler binding project, but it does not seems to provide full access to the annotations. My question kind of boils down to "Am I doing it wrong or is the library incomplete?" and "Are there any others that provide the same functionality?" – davidb Jul 10 '09 at 13:54

score 0 · Answer 9 · answered Jan 14 '18 at 22:25

Somebody asked a similar question. I tried the code sample there and it did not work for me until I made a few functional and cosmetic changes.

#!/usr/bin/ruby

require 'pdf-reader'

ARGV.each do |filename|
  PDF::Reader.open(filename) do |reader|
    puts "file: #{filename}"
    puts "page\tcomment"
    reader.pages.each do |page|
      annots_ref = page.attributes[:Annots]
      if annots_ref
        actual_annots = annots_ref.map { |a| reader.objects[a] }
        actual_annots.each do |actual_annot|
          unless actual_annot[:Contents].nil?
            puts "#{page.number}\t#{actual_annot[:Contents]}"
          end
        end
      end
    end       
  end
end

If saved as pdfannot.rb, chmod +x'ed and placed into your favourite PATH directory, usage is:

./pdfannot.rb <path>

First time writing/editing/remixing Ruby code, so very open for suggestions. HTH.

On a side note, finding this question earlier could have saved me from double work. Hopefully this question gets more attention in the future such that it is easier to find.

score 0 · Answer 10 · answered Sep 08 '22 at 09:20

I updated Enno Gröper's poppler script to work with python3 and poppler-qt5. I also extract specifically the annotations that are comment-like which I use in grading student assignments. I am further developing it as part of my teaching-tools project at https://github.com/foleyj2/teaching-tools under extract-pdf-comments.py

from popplerqt5 import Poppler
import sys
#import urllib ##might be useful for extracting from web documents
import os

SubTypes = ("BASE", #0 base class
            "Text", #1 Text callout (bubble)
            "Line", #2 strike out
            "Geometry", #3 geometric figure, like a rectangle or an ellipse. 
            "Highlight",#4 some areas of text being "highlighted"
            "Stamp", #5 drawing a stamp on a page
            "Ink", #6 ink path on a page
            "Link", #7 link to something else (internal or external)
            "Caret", #8 a symbol to indicate the presence of text. 
            "FileAttachment", #9 file embedded in the document
            "Sound", #10 sound to be played when activated.
            "Movie", #11 movie to be played when activated.
            "Screen", #12 screen to be played when activated.
            "Widget", #13 widget (form field) on a page
            "RichMedia" #14 video or sound on a page.
            )

def main():
  input_filename = sys.argv[1]
  document = Poppler.Document.load(input_filename)
  n_pages = document.numPages()

  for i in range(n_pages):
    page = document.page(i)
    print(f"Processing page {i+1}")
    for annotation in page.annotations():
      subtype_num = annotation.subType()
      subtype = SubTypes[subtype_num]
      #print(f"{subtype_num}={subtype}: {annotation.contents()}")

      ## For grading purposes, I only care about the Highlight and Text
      ## annoation subtypes
      if subtype in {"Text","Highlight"}:     
        print(f"Annotation suitable for grading: '{annotation.contents()}'")
                                  
    if len(page.annotations()) < 1:      
      print("no annotations found")

if __name__ == "__main__":
  main()

Parse annotations from a pdf

10 Answers10

Prerequisites

xfdf XML

Using PyPDF2

Linked

Related