As long as the figures in question are appropriately tagged (as they are in your example document), you can determine their bounding boxes based on the PDFBox PDFGraphicsStreamEngine
.
You actually can make use of the BoundingBoxFinder
from this answer (based on the PDFGraphicsStreamEngine
) which determines the bounding box of all content of a page, you merely have to retrieve the bounding box information marked content sequence by marked content sequence.
The following class does that by storing bounding box information in a hierarchy of MarkedContext
objects
public class MarkedContentBoundingBoxFinder extends BoundingBoxFinder {
public MarkedContentBoundingBoxFinder(PDPage page) {
super(page);
contents.add(content);
}
@Override
public void processPage(PDPage page) throws IOException {
super.processPage(page);
endMarkedContentSequence();
}
@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
MarkedContent current = contents.getLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = rectangle;
}
rectangle = null;
MarkedContent newContent = new MarkedContent(tag, properties);
contents.addLast(newContent);
current.children.add(newContent);
super.beginMarkedContentSequence(tag, properties);
}
@Override
public void endMarkedContentSequence() {
MarkedContent current = contents.removeLast();
if (rectangle != null) {
if (current.boundingBox != null)
add(current.boundingBox);
current.boundingBox = (Rectangle2D) rectangle.clone();
} else if (current.boundingBox != null)
rectangle = (Rectangle2D) current.boundingBox.clone();
super.endMarkedContentSequence();
}
public static class MarkedContent {
public MarkedContent(COSName tag, COSDictionary properties) {
this.tag = tag;
this.properties = properties;
}
public final COSName tag;
public final COSDictionary properties;
public final List<MarkedContent> children = new ArrayList<>();
public Rectangle2D boundingBox = null;
}
public final MarkedContent content = new MarkedContent(COSName.DOCUMENT, null);
public final Deque<MarkedContent> contents = new ArrayDeque<>();
}
(MarkedContentBoundingBoxFinder utility class)
You can apply it to a PDPage pdPage
like this
MarkedContentBoundingBoxFinder boxFinder = new MarkedContentBoundingBoxFinder(pdPage);
boxFinder.processPage(pdPage);
MarkedContent markedContent = boxFinder.content;
(excerpt from DetermineBoundingBox helper method drawMarkedContentBoundingBoxes
)
You can output the bounding boxes from that markedContent
object like this:
void printMarkedContentBoundingBoxes(MarkedContent markedContent, String prefix) {
StringBuilder builder = new StringBuilder();
builder.append(prefix).append(markedContent.tag.getName());
builder.append(' ').append(markedContent.boundingBox);
System.out.println(builder.toString());
for (MarkedContent child : markedContent.children)
printMarkedContentBoundingBoxes(child, prefix + " ");
}
(DetermineBoundingBox helper method)
In case of your example document you get
Document java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=128.63946533203125,h=10.2509765625]
Figure java.awt.geom.Rectangle2D$Double[x=90.35800170898438,y=758.10498046875,w=44.6771240234375,h=10.2509765625]
P java.awt.geom.Rectangle2D$Double[x=136.79600524902344,y=760.1184081963065,w=43.137100359018405,h=6.383056943803922]
Figure java.awt.geom.Rectangle2D$Double[x=184.2926788330078,y=758.10498046875,w=34.70478820800781,h=10.2509765625]
Similarly you can draw the bounding boxes on the PDF using the drawMarkedContentBoundingBoxes
methods of DetermineBoundingBox. In case of your example document you get:
