0

The first page at this PDF displays the following white decorated text on top of an image. enter image description here

When using the PDFBox utility PrintImageLocations, this graphics is not extracted as an image, only the background image is extracted, without the white decorated text. When converting to Word doc, the decorated text is extracted as a shape with properties which can be modified, such as fill color, border color, and much more.

Is it possible to extract that shape from the PDF, using PDFBox? How?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Orit
  • 183
  • 2
  • 10
  • 2
    This is partially answered here https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions – Tilman Hausherr Dec 19 '21 at 09:10
  • @TilmanHausherr Thanks !!! I have modified the LineCatcher sample, and now drawing the (flipped) shape on Graphics2D object. How can I get the drawing (stroke and fill) color from PDFGraphicsStreamEngine ? – Orit Dec 19 '21 at 12:16
  • 1
    `getGraphicsState().getStrokingColor()` and `getGraphicsState().getNonStrokingColor()` – Tilman Hausherr Dec 19 '21 at 19:11

1 Answers1

0

The simplest way to extract such graphics is to reverse engineer those that can be into ScaledVectorGraphics as here I had to change colour from white to magenta otherwise it would look like a snowscape.

enter image description here

I dont use PDFbox so cant say how easy that may be possible .I simply exported page 1 as SVG using

MuPDF\mutool.exe convert -o page1.svg -O no-reuse-images Xcel_Energy-AR2018.pdf 1

However you will get all SVG output such as the lower text and note the extra header text in the top left corner and lower left corner page number that were not visible behind the pixel grapics.

enter image description here

Note: that everything (thus any conventional text and image pixels are converted to SVG objects) there is no easier way to extract all the PostScript Printer style moves and lineto's. So yes it is overkill as it needs parsing to get just the object of interest (more easily done in a GUI such as inkscape or InDesign where it was constructed). It is not a good methodology for shape recognition since the y x values are described as rectangles, and will have positions and scalars that most likely vary from page to page, thus there are no constants other than filled appearance. The filled object would best be "seen" by regeneration as pixels for visual symbol recognition (much like OCR).

K J
  • 8,045
  • 3
  • 14
  • 36
  • Thanks for that ! Seems like your suggestion does not fully match my requirement. I do not want to extract the text (Destination 2050...) together with the shape at right top corner. The text is extracted by another flow, into JSON objects. The shape is needed for further analysis. – Orit Dec 19 '21 at 13:37
  • Yes, we do that. Thanks ! – Orit Dec 20 '21 at 12:02