I would like to use python to extract highlights, text box and text box color from PDFs.
I am having trouble installing poppler
, mentioned in the related question Extracting PDF annotations/comments
And I couldn't find how to make the fitz
package extract highlighted text (https://stackoverflow.com/a/61761129/1273751) from a pdf with highlights on pg 3, 4, and 14. And text boxes on pg 4 and 14.
import fitz
doc = fitz.open("example.pdf")
for i in range(doc.pageCount):
page = doc[i]
for annot in page.annots():
print(i, "||", annot.info["content"], "||", annot.colors, "||", annot.type)
Output was:
3 || || {'stroke': [0.4156799912452698, 0.8509830236434937, 0.15685999393463135], 'fill': []} || (8, 'Highlight')
4 || || {'stroke': [0.9843140244483948, 0.5333399772644043, 1.0], 'fill': []} || (8, 'Highlight')
4 || || {'stroke': [0.4156799912452698, 0.8509830236434937, 0.15685999393463135], 'fill': []} || (8, 'Highlight')
4 || || {'stroke': [1.0, 0.8196110129356384, 0.0], 'fill': []} || (8, 'Highlight')
4 || how does it allow that? || {'stroke': [0.9882349967956543, 0.9568629860877991, 0.5215759873390198], 'fill': []} || (2, 'FreeText')
14 || what's ensemble accuracy? || {'stroke': [0.9882349967956543, 0.9568629860877991, 0.5215759873390198], 'fill': []} || (2, 'FreeText')
14 || || {'stroke': [1.0, 0.8196110129356384, 0.0], 'fill': []} || (8, 'Highlight')
For the highlights, it gives me the color of the highlight, but not the actual text that was highlighted.
It works well for the text box, though.
Answer to another related question: https://stackoverflow.com/a/65631205/1273751
Thank you!