How to extract highlighted text from a PDF?

Question

The solutions to this, this and this question show how to extract comments made in popup/sticky notes and on highlighted text areas. However, I have not found a solution to how to extract the highlighted text itself. Is this possible using Python (or bash, Unix command line tools, or other another command line solution that is straightforward and not too large to install)?

So far I have tried with PyPDF2 and python-poppler-qt5. With PyPDF2, I can get the QuatPoints or Rect regions which I suspect can be used to extract a region of text from the page, but I could not get p2.extractText() to work at all so not sure how to proceed:

import PyPDF2 as pp2

pdf_file = open('sample.pdf', 'rb')
pdf = pp2.PdfFileReader(pdf_file)
p2 = pdf.getPage(2)
p2['/Annots'][4].getObject()

Out:

    {'/Type'{'/Type': '/Annot',
     '/Subtype': '/Highlight',
     '/Subj':'Highlight',
     '/T': 'joel',
     '/F': 4,
     '/NM': 'b3a6d3a3-bdab-457b-a769-9d82e616798a',
     '/CreationDate': 'D:20190704104139',
     '/CA': 1,
     '/Rect': [58.9415, 184.958, 550.855, 235.086],
     '/C': [0.99608, 0.99608, 0.68235],
     '/QuadPoints': [430.769,
      219.662,
      550.304,
      219.662,
      430.769,
      235.086,
      550.304,
      235.086,
      58.9415,
      202.585,
      531.575,
      202.585,
      58.9415,
      218.56,
      531.575,
      218.56,
      59.4924,
      185.509,
      313.437,
      185.509,
      59.4924,
      201.483,
      313.437,
      201.483],
     '/AP': {'/N': {'/Subtype': '/Form',
       '/FormType': 1,
       '/BBox': [0, 0, 491.914, 50.1278],
       '/Resources': {'/ExtGState': {'/TransGs': {'/CA': 1,
          '/ca': 1,
          '/BM': '/Multiply',
          '/Type': '/ExtGState'}}},
       '/Group': {'/S': '/Transparency', '/Type': '/Group'},
       '/Filter': '/FlateDecode'}},
     '/M': "D:20190704104139+00'00'",
     '/Contents': 'sample text written on highlighted area'}

With python-poppler-qt5, I can also get the boundary of the highlight area, but when I try to extract text from that region, an empty string is returned:

import popplerqt5                                                                              
d = popplerqt5.Poppler.Document.load('sample.pdf')                                                          
p2 = d[2]                                                                                      
a = p2.annotations()                                                                           
t = a[4]
p2.text(t.boundary())  # Returns an empty string.

My current solution is to use Zotero with zotfile that does exactly this (via pdf.js), but it is a bit tedious when I have multiple PDFs so I would like to automate the process if possible.

I posted an answer here https://stackoverflow.com/a/59959625/2166823 — joelostblom, Jan 29 '20 at 02:06

How to extract highlighted text from a PDF?

0 Answers0