4

We're printing some PDFs from a Java desktop app, using PDFBox, and the PDFs contain too much whitespace (fixing the PDF generator is unfortunately not an option).

The problem I have is determining where the actual content on the page is, because the crop/media/trim/art/bleed boxes are useless. Is there some easy and efficient way to do so, better/faster than rendering the page to an image and examining which pixels stayed white?

enter image description here

xs0
  • 2,990
  • 17
  • 25
  • Perhaps if you know enough about the contents/structure of your PDF files - in your illustration there is a background box, so maybe you can look for it. Otherwise, you may want to subclass [`PDFGraphicsStreamEngine`](https://pdfbox.apache.org/docs/2.0.11/javadocs/org/apache/pdfbox/contentstream/PDFGraphicsStreamEngine.html) to determine the desired dimensions without actually rendering to image. See e.g. [this example](https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/rendering/CustomGraphicsStreamEngine.java?view=markup) – Itai Oct 16 '18 at 08:19
  • Well, I'd like the solution to be able to handle any future PDF that comes its way, not just the current specific ones.. – xs0 Oct 22 '18 at 09:30
  • What exactly is the content to keep in your case? E.g. should everything drawn be considered content? That might result in unexpectedly large bounding boxes as some applications start by painting the whole page in the background color. As you don't share your pdfs, it's hard to tell whether that's an issue with your pdfs. What about invisible drawings like text or vector graphics in background color? You want to be able to handle *"any future PDF"*... a generic solution may be beyond the scope of an answer here... can we at least assume a white background? – mkl Oct 22 '18 at 10:05
  • Yes, it can be assumed that there is no background or other elements that would need special handling. I can't share the exact nature of PDFs, but to a first approximation, it's just a relatively narrow column of text with some graphical elements (lines and QR codes) every now and then (their number/existence and position is not fixed). In other words, I'm looking for the axis-aligned minimum bounding box of all content on a page. – xs0 Oct 22 '18 at 15:46
  • Why you not use iText for generating PDFs? –  Oct 23 '18 at 12:24
  • 1
    @user5377037 *"Why you not use iText for generating PDFs?"* - How does that question help here? In particular as the solution of this issue is similarly difficult / easy with either library... – mkl Oct 23 '18 at 13:40

1 Answers1

4

As you have mentioned in a comment that

it can be assumed that there is no background or other elements that would need special handling,

I'll show a basic solution without any such special handling.

A basic bounding box finder

To find the bounding box without actually rendering to a bitmap and inspecting the bitmap pixels, one has to scan all the instructions of the content streams of the page and any XObjects referenced from there. One determines the bounding boxes of the stuff drawn by each instruction and eventually combines them to a single box.

The simple box finder presented here combines them by simply returning the bounding box of their union.

For scanning the instructions of content streams PDFBox offers a number of classes based on the PDFStreamEngine. The simple box finder is derived from the PDFGraphicsStreamEngine which extends the PDFStreamEngine by some method related to vector graphics.

public class BoundingBoxFinder extends PDFGraphicsStreamEngine {
    public BoundingBoxFinder(PDPage page) {
        super(page);
    }

    public Rectangle2D getBoundingBox() {
        return rectangle;
    }

    //
    // Text
    //
    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
        Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
        if (shape != null) {
            Rectangle2D rect = shape.getBounds2D();
            add(rect);
        }
    }

    /**
     * Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
     */
    private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
    {
        GeneralPath path = null;
        AffineTransform at = textRenderingMatrix.createAffineTransform();
        at.concatenate(font.getFontMatrix().createAffineTransform());
        if (font instanceof PDType3Font)
        {
            // It is difficult to calculate the real individual glyph bounds for type 3 fonts
            // because these are not vector fonts, the content stream could contain almost anything
            // that is found in page content streams.
            PDType3Font t3Font = (PDType3Font) font;
            PDType3CharProc charProc = t3Font.getCharProc(code);
            if (charProc != null)
            {
                BoundingBox fontBBox = t3Font.getBoundingBox();
                PDRectangle glyphBBox = charProc.getGlyphBBox();
                if (glyphBBox != null)
                {
                    // PDFBOX-3850: glyph bbox could be larger than the font bbox
                    glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                    glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                    glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                    glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                    path = glyphBBox.toGeneralPath();
                }
            }
        }
        else if (font instanceof PDVectorFont)
        {
            PDVectorFont vectorFont = (PDVectorFont) font;
            path = vectorFont.getPath(code);

            if (font instanceof PDTrueTypeFont)
            {
                PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
                int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
            if (font instanceof PDType0Font)
            {
                PDType0Font t0font = (PDType0Font) font;
                if (t0font.getDescendantFont() instanceof PDCIDFontType2)
                {
                    int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                    at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                }
            }
        }
        else if (font instanceof PDSimpleFont)
        {
            PDSimpleFont simpleFont = (PDSimpleFont) font;

            // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
            // which is why PDVectorFont is tried first.
            String name = simpleFont.getEncoding().getName(code);
            path = simpleFont.getPath(name);
        }
        else
        {
            // shouldn't happen, please open issue in JIRA
            System.out.println("Unknown font class: " + font.getClass());
        }
        if (path == null)
        {
            return null;
        }
        return at.createTransformedShape(path.getBounds2D());
    }

    //
    // Bitmaps
    //
    @Override
    public void drawImage(PDImage pdImage) throws IOException {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        for (int x = 0; x < 2; x++) {
            for (int y = 0; y < 2; y++) {
                add(ctm.transformPoint(x, y));
            }
        }
    }

    //
    // Paths
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        addToPath(p0, p1, p2, p3);
    }

    @Override
    public void clip(int windingRule) throws IOException {
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        addToPath(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        addToPath(x, y);
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        addToPath(x1, y1);
        addToPath(x2, y2);
        addToPath(x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return null;
    }

    @Override
    public void closePath() throws IOException {
    }

    @Override
    public void endPath() throws IOException {
        rectanglePath = null;
    }

    @Override
    public void strokePath() throws IOException {
        addPath();
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        addPath();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        addPath();
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException {
    }

    void addToPath(Point2D... points) {
        Arrays.asList(points).forEach(p -> addToPath(p.getX(), p.getY()));
    }

    void addToPath(double newx, double newy) {
        if (rectanglePath == null) {
            rectanglePath = new Rectangle2D.Double(newx, newy, 0, 0);
        } else {
            rectanglePath.add(newx, newy);
        }
    }

    void addPath() {
        if (rectanglePath != null) {
            add(rectanglePath);
            rectanglePath = null;
        }
    }

    void add(Rectangle2D rect) {
        if (rectangle == null) {
            rectangle = new Rectangle2D.Double();
            rectangle.setRect(rect);
        } else {
            rectangle.add(rect);
        }
    }

    void add(Point2D... points) {
        for (Point2D point : points) {
            add(point.getX(), point.getY());
        }
    }

    void add(double newx, double newy) {
        if (rectangle == null) {
            rectangle = new Rectangle2D.Double(newx, newy, 0, 0);
        } else {
            rectangle.add(newx, newy);
        }
    }

    Rectangle2D rectanglePath = null;
    Rectangle2D rectangle = null;
}

(BoundingBoxFinder on github)

As you can see I borrowed the calculateGlyphBounds helper method from a PDFBox example class.

An usage example

You can use the BoundingBoxFinder like this to draw a border line along the bounding box rim for a given PDPage pdPage of a PDDocument pdDocument:

void drawBoundingBox(PDDocument pdDocument, PDPage pdPage) throws IOException {
    BoundingBoxFinder boxFinder = new BoundingBoxFinder(pdPage);
    boxFinder.processPage(pdPage);
    Rectangle2D box = boxFinder.getBoundingBox();
    if (box != null) {
        try (   PDPageContentStream canvas = new PDPageContentStream(pdDocument, pdPage, AppendMode.APPEND, true, true)) {
            canvas.setStrokingColor(Color.magenta);
            canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
            canvas.stroke();
        }
    }
}

(DetermineBoundingBox helper method)

The result looks like this:

Screenshot

Only a proof-of-concept

Beware, the BoundingBoxFinder really is not very sophisticated; in particular it does not ignore invisible content like a white background rectangle, text drawn in rendering mode "invisible", arbitrary content covered by a white filled path, white parts of bitmap images, ... Furthermore, it does ignore clip paths, weird blend modes, annotations, ...

Extending the class to properly handle those cases is pretty straight-forward but the sum of the code to add would exceed the scope of a stack overflow answer.


For the code in this answer I used the current PDFBox 3.0.0-SNAPSHOT development branch but it should also work out of the box for current 2.x versions.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you very much, this looks to be exactly what I needed. – xs0 Oct 26 '18 at 15:12
  • I used `calculateGlyphBounds` for individual characters. It looks greate on the PDF, but when rendered to an image (300 DPI), Y bbox coords are drawn with an offset - too high, and bottom character edges are cutted in some cases. How can this be solved ? – Orit Oct 26 '21 at 09:53
  • @Orit As you can see in the comment of that method, I simply copied that method from the PDFBox example `DrawPrintTextLocations`. So you may want to test that tool first. If that also does not return the desired output, consider creating a question in its own right with more context and example files. – mkl Oct 26 '21 at 10:24