How to extract highlights and text box contents from pdf in python?

Question

I would like to use python to extract highlights, text box and text box color from PDFs.

I am having trouble installing poppler, mentioned in the related question Extracting PDF annotations/comments

And I couldn't find how to make the fitz package extract highlighted text (https://stackoverflow.com/a/61761129/1273751) from a pdf with highlights on pg 3, 4, and 14. And text boxes on pg 4 and 14.

import fitz
doc = fitz.open("example.pdf")
for i in range(doc.pageCount):
    page = doc[i]
    for annot in page.annots():
        print(i, "||", annot.info["content"], "||", annot.colors, "||", annot.type)

Output was:

3 ||  || {'stroke': [0.4156799912452698, 0.8509830236434937, 0.15685999393463135], 'fill': []} || (8, 'Highlight')
4 ||  || {'stroke': [0.9843140244483948, 0.5333399772644043, 1.0], 'fill': []} || (8, 'Highlight')                                                                                                                                       
4 ||  || {'stroke': [0.4156799912452698, 0.8509830236434937, 0.15685999393463135], 'fill': []} || (8, 'Highlight')                                                                                                                           
4 ||  || {'stroke': [1.0, 0.8196110129356384, 0.0], 'fill': []} || (8, 'Highlight')                                                           
4 || how does it allow that? || {'stroke': [0.9882349967956543, 0.9568629860877991, 0.5215759873390198], 'fill': []} || (2, 'FreeText')                                                                                                      
14 || what's ensemble accuracy? || {'stroke': [0.9882349967956543, 0.9568629860877991, 0.5215759873390198], 'fill': []} || (2, 'FreeText')                                                                                                   
14 ||  || {'stroke': [1.0, 0.8196110129356384, 0.0], 'fill': []} || (8, 'Highlight')

For the highlights, it gives me the color of the highlight, but not the actual text that was highlighted.

It works well for the text box, though.

Answer to another related question: https://stackoverflow.com/a/65631205/1273751

Thank you!

It looks like that has nicely extracted highlights and highlight text. Isn't that what you wanted? — Tim Roberts, May 20 '22 at 03:19
I edited it to become clearer about what is missing: the text content of what was highlighted — Homero Esmeraldo, May 20 '22 at 03:26

score 0 · Answer 1 · answered May 20 '22 at 03:40

0

With fitz, I managed to extract text box contents.

And now I have found an answer using fitz to extract the highlighted text content https://stackoverflow.com/a/63686095/1273751

(I have not tested it)

answered May 20 '22 at 03:40

Homero Esmeraldo

1,864
2
18
34

Given that the linked SO question/answer answers your question, do you see any value in leaving your question and answer up? That is does you Q/A _add_ something that doesn't exist in the other Q/A? – Zach Young May 20 '22 at 17:52
It's debatable. But I believe having the outputs that I put in my question is useful. And also that I am talking here about both highlights and text boxes. I would leave it, but I can be convinced otherwise. Not sure how to reason about what to do in this case. What do you think? The alternative would be just deleting? Mark as duplicate? What is it? – Homero Esmeraldo May 20 '22 at 22:57

How to extract highlights and text box contents from pdf in python?

1 Answers1

Linked