0

I would like to use python to extract highlights, text box and text box color from PDFs.

I am having trouble installing poppler, mentioned in the related question Extracting PDF annotations/comments

And I couldn't find how to make the fitz package extract highlighted text (https://stackoverflow.com/a/61761129/1273751) from a pdf with highlights on pg 3, 4, and 14. And text boxes on pg 4 and 14.

import fitz
doc = fitz.open("example.pdf")
for i in range(doc.pageCount):
    page = doc[i]
    for annot in page.annots():
        print(i, "||", annot.info["content"], "||", annot.colors, "||", annot.type)

Output was:

3 ||  || {'stroke': [0.4156799912452698, 0.8509830236434937, 0.15685999393463135], 'fill': []} || (8, 'Highlight')
4 ||  || {'stroke': [0.9843140244483948, 0.5333399772644043, 1.0], 'fill': []} || (8, 'Highlight')                                                                                                                                       
4 ||  || {'stroke': [0.4156799912452698, 0.8509830236434937, 0.15685999393463135], 'fill': []} || (8, 'Highlight')                                                                                                                           
4 ||  || {'stroke': [1.0, 0.8196110129356384, 0.0], 'fill': []} || (8, 'Highlight')                                                           
4 || how does it allow that? || {'stroke': [0.9882349967956543, 0.9568629860877991, 0.5215759873390198], 'fill': []} || (2, 'FreeText')                                                                                                      
14 || what's ensemble accuracy? || {'stroke': [0.9882349967956543, 0.9568629860877991, 0.5215759873390198], 'fill': []} || (2, 'FreeText')                                                                                                   
14 ||  || {'stroke': [1.0, 0.8196110129356384, 0.0], 'fill': []} || (8, 'Highlight')

For the highlights, it gives me the color of the highlight, but not the actual text that was highlighted.

It works well for the text box, though.

Answer to another related question: https://stackoverflow.com/a/65631205/1273751

Thank you!

Homero Esmeraldo
  • 1,864
  • 2
  • 18
  • 34

1 Answers1

0

With fitz, I managed to extract text box contents.

And now I have found an answer using fitz to extract the highlighted text content https://stackoverflow.com/a/63686095/1273751

(I have not tested it)

Homero Esmeraldo
  • 1,864
  • 2
  • 18
  • 34
  • Given that the linked SO question/answer answers your question, do you see any value in leaving your question and answer up? That is does you Q/A _add_ something that doesn't exist in the other Q/A? – Zach Young May 20 '22 at 17:52
  • It's debatable. But I believe having the outputs that I put in my question is useful. And also that I am talking here about both highlights and text boxes. I would leave it, but I can be convinced otherwise. Not sure how to reason about what to do in this case. What do you think? The alternative would be just deleting? Mark as duplicate? What is it? – Homero Esmeraldo May 20 '22 at 22:57