0

I have a PDF file, this file contains some text and some highlighted text. The highligthed text has not been set with a PDF editor but with Microsoft Word (so this file is a "Print to PDF" word file).

I would like to extract/detect the highlighted text, and if it's possible the color.

I searched and tried different code and my results are :

  • I can extract the highlighted text if it has been highlighted with PDF editor
  • I can't extract the highlighted text if it has been highlighted with word

I would like to know if you have any idea how to extract this highlighted text that isn't set with a PDF editor.

I followed this thread :

Identifying the text based on the output in PDF using PDFBOX

I would like to get (if possible) a result like this :

H{FILL:RGB 0.102 0.101 0.095;}E{FILL:RGB 0.102 0.101 0.095;}L{FILL:RGB 0.102 0.101 0.095;}L{FILL:RGB 0.102 0.101 0.095;}O{FILL:RGB 0.102 0.101 0.095;}

But with the highlighted color of each character and not the character color.

I tried many things, i can't find a way to extract which part of my PDF is highlighted or not.

I have no more idea i can do...

I hope someone can help me !

macropod
  • 12,757
  • 2
  • 9
  • 21
Treasm
  • 23
  • 4
  • 1
    Please share a representative example PDF. – mkl Jan 12 '23 at 13:30
  • https://uploadnow.io/f/BMfflX9 here is a pdf exemple (not allowed to share the real one because of confidentials informations) – Treasm Jan 12 '23 at 14:15
  • thank you for your answers, i thought the art and text would be connected. so this is not really possible, in Java, to detect which character is hightlighted ? – Treasm Jan 13 '23 at 07:04
  • if anyone has any idea... i'm desperate :') – Treasm Jan 16 '23 at 07:35

0 Answers0