PDFBOX - Extract highlighted text that has been set with microsoft word

Asked Jan 12 '23 at 12:32

Active Jan 12 '23 at 20:58

Viewed 93 times

I have a PDF file, this file contains some text and some highlighted text. The highligthed text has not been set with a PDF editor but with Microsoft Word (so this file is a "Print to PDF" word file).

I would like to extract/detect the highlighted text, and if it's possible the color.

I searched and tried different code and my results are :

I can extract the highlighted text if it has been highlighted with PDF editor
I can't extract the highlighted text if it has been highlighted with word

I would like to know if you have any idea how to extract this highlighted text that isn't set with a PDF editor.

I followed this thread :

Identifying the text based on the output in PDF using PDFBOX

I would like to get (if possible) a result like this :

H{FILL:RGB 0.102 0.101 0.095;}E{FILL:RGB 0.102 0.101 0.095;}L{FILL:RGB 0.102 0.101 0.095;}L{FILL:RGB 0.102 0.101 0.095;}O{FILL:RGB 0.102 0.101 0.095;}

But with the highlighted color of each character and not the character color.

I tried many things, i can't find a way to extract which part of my PDF is highlighted or not.

I have no more idea i can do...

I hope someone can help me !

edited Jan 12 '23 at 20:58

macropod

12,757
2
9
21

asked Jan 12 '23 at 12:32

Treasm

1

Please share a representative example PDF. – mkl Jan 12 '23 at 13:30
https://uploadnow.io/f/BMfflX9 here is a pdf exemple (not allowed to share the real one because of confidentials informations) – Treasm Jan 12 '23 at 14:15
thank you for your answers, i thought the art and text would be connected. so this is not really possible, in Java, to detect which character is hightlighted ? – Treasm Jan 13 '23 at 07:04
if anyone has any idea... i'm desperate :') – Treasm Jan 16 '23 at 07:35

PDFBOX - Extract highlighted text that has been set with microsoft word

0 Answers0