0

I have simple PDF generated by Apache FOP + jEuclid. This PDF has vector graphics for math formulas and the text:

enter image description here

Link to PDF: https://www.dropbox.com/s/w4ksnud78bu9oz5/test.pdf?dl=0

I would like to know bounding box (x,y,width,height) for each vector graphics. I've tried this example: https://svn.apache.org/repos/asf/pdfbox/tags/2.0.24/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java, but it doesn't output any information , only this:

Processing page: 1

In the Acrobat I can select the vector images in the Tags tree and it highlights them: enter image description here

My question - how to determine bounding box for vector images via PDFBox API?

  • Maybe you can modify this solution? https://stackoverflow.com/questions/38931422/pdfbox-2-0-2-calling-of-pagedrawer-processpage-method-caught-exceptions – Tilman Hausherr Jul 29 '21 at 03:23
  • @TilmanHausherr thank you for the advice. I've tried it, but looks like PDFGraphicsStreamEngine is very low-level feature for my purpose. It allows to determine coordinates for each glyph in the formulas, i.e. it catches paths for each concrete glyph. I can't understand when one one group of paths end and start of another group... – Alexander Dyuzhev Jul 29 '21 at 21:42
  • You can catch "fillPath", that would help for this PDF. – Tilman Hausherr Jul 30 '21 at 02:58
  • fillPath called 13 times, i.e. for each glyph/character in both formulas. Actually, besides bounding box I need to know alternate (or actual) text related to formula. So may be I have to use PDFMarkedContentExtractor. – Alexander Dyuzhev Jul 30 '21 at 07:00
  • Your PDF does not have alternate text for these vector graphics. You can have a look at it with PDFDebugger. – Tilman Hausherr Jul 30 '21 at 18:57
  • Hmm, it's very strange. I've checked PDF Acrobat (full version, not Reader) in Tags tab and it has alternate and actual texts. Also you can find the text '/Alt (Math)' and '/ActualText ( – Alexander Dyuzhev Jul 31 '21 at 11:05
  • Oops, you're right, I searched in the content stream (which can also have this). What you found is at `Root/StructTreeRoot/K/[0]/K/[0]/K/[0]/K/[0]/K/[0]/K/[0]/K/[0]/ActualText` and `Root/StructTreeRoot/K/[0]/K/[0]/K/[0]/K/[0]/K/[0]/K/[2]/K/[0]/ActualText`. That's the structure tree, which I'm mostly clueless about :-( – Tilman Hausherr Jul 31 '21 at 11:11

1 Answers1

1

As long as the figures in question are appropriately tagged (as they are in your example document), you can determine their bounding boxes based on the PDFBox PDFGraphicsStreamEngine.

You actually can make use of the BoundingBoxFinder from this answer (based on the PDFGraphicsStreamEngine) which determines the bounding box of all content of a page, you merely have to retrieve the bounding box information marked content sequence by marked content sequence.

The following class does that by storing bounding box information in a hierarchy of MarkedContext objects

public class MarkedContentBoundingBoxFinder extends BoundingBoxFinder {
    public MarkedContentBoundingBoxFinder(PDPage page) {
        super(page);
        contents.add(content);
    }

    @Override
    public void processPage(PDPage page) throws IOException {
        super.processPage(page);
        endMarkedContentSequence();
    }

    @Override
    public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
        MarkedContent current = contents.getLast();
        if (rectangle != null) {
            if (current.boundingBox != null)
                add(current.boundingBox);
            current.boundingBox = rectangle;
        }
        rectangle = null;
        MarkedContent newContent = new MarkedContent(tag, properties);
        contents.addLast(newContent);
        current.children.add(newContent);

        super.beginMarkedContentSequence(tag, properties);
    }

    @Override
    public void endMarkedContentSequence() {
        MarkedContent current = contents.removeLast();
        if (rectangle != null) {
            if (current.boundingBox != null)
                add(current.boundingBox);
            current.boundingBox = (Rectangle2D) rectangle.clone();
        } else if (current.boundingBox != null)
            rectangle = (Rectangle2D) current.boundingBox.clone();

        super.endMarkedContentSequence();
    }

    public static class MarkedContent {
        public MarkedContent(COSName tag, COSDictionary properties) {
            this.tag = tag;
            this.properties = properties;
        }

        public final COSName tag;
        public final COSDictionary properties;
        public final List<MarkedContent> children = new ArrayList<>();
        public Rectangle2D boundingBox = null;
    }

    public final MarkedContent content = new MarkedContent(COSName.DOCUMENT, null);
    public final Deque<MarkedContent> contents = new ArrayDeque<>();
}

(MarkedContentBoundingBoxFinder utility class)

You can apply it to a PDPage pdPage like this

MarkedContentBoundingBoxFinder boxFinder = new MarkedContentBoundingBoxFinder(pdPage);
boxFinder.processPage(pdPage);
MarkedContent markedContent = boxFinder.content;

(excerpt from DetermineBoundingBox helper method drawMarkedContentBoundingBoxes)

You can output the bounding boxes from that markedContent object like this:

void printMarkedContentBoundingBoxes(MarkedContent markedContent, String prefix) {
    StringBuilder builder = new StringBuilder();
    builder.append(prefix).append(markedContent.tag.getName());
    builder.append(' ').append(markedContent.boundingBox);
    System.out.println(builder.toString());
    for (MarkedContent child : markedContent.children)
        printMarkedContentBoundingBoxes(child, prefix + "  ");
}

(DetermineBoundingBox helper method)

In case of your example document you get

Document java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=128.63946533203125,h=10.2509765625]
  Figure java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=44.6771240234375,h=10.2509765625]
  P java.awt.geom.Rectangle2D$Double[x=136.79600524902344,y=760.1184081963065,w=43.137100359018405,h=6.383056943803922]
  Figure java.awt.geom.Rectangle2D$Double[x=184.2926788330078,y=758.10498046875,w=34.70478820800781,h=10.2509765625]

Similarly you can draw the bounding boxes on the PDF using the drawMarkedContentBoundingBoxes methods of DetermineBoundingBox. In case of your example document you get:

screen shot

mkl
  • 90,588
  • 15
  • 125
  • 265