I have a PDF file, this file contains some text and some highlighted text. The highligthed text has not been set with a PDF editor but with Microsoft Word (so this file is a "Print to PDF" word file).
I would like to extract/detect the highlighted text, and if it's possible the color.
I searched and tried different code and my results are :
- I can extract the highlighted text if it has been highlighted with PDF editor
- I can't extract the highlighted text if it has been highlighted with word
I would like to know if you have any idea how to extract this highlighted text that isn't set with a PDF editor.
I followed this thread :
Identifying the text based on the output in PDF using PDFBOX
I would like to get (if possible) a result like this :
H{FILL:RGB 0.102 0.101 0.095;}E{FILL:RGB 0.102 0.101 0.095;}L{FILL:RGB 0.102 0.101 0.095;}L{FILL:RGB 0.102 0.101 0.095;}O{FILL:RGB 0.102 0.101 0.095;}
But with the highlighted color of each character and not the character color.
I tried many things, i can't find a way to extract which part of my PDF is highlighted or not.
I have no more idea i can do...
I hope someone can help me !