Why does PDFBox return image dimension of size 0 x 0

Question

To find the actual size taken by an image on a PDF, I use PDFBox, and I followed what is described in this SO answer. So basically I call

 // Computes the image actual location and dimensions
 PrintImageLocations renderer = new PrintImageLocations();

 for (int i = 0; i < pageLimit; ++i) {
        PDPage page = pdf.getPage(i);

        renderer.processPage(page);
 }

and the PrintImageLocations() is taken from this PDFBox code example.

Yet with a PDF document that I use for test (generated by GPL Ghostscript 910 (ps2write) from an image found on Wikipedia), the image size reported is 0 x 0 (although the PDF can be imported into Gimp or Libre Office Draw).

So I'd like to know if the code I am currently using is reliable or not to find image size, and what could make it not find the right image size ?

The PDF used for this test can be found here

==========

Edit : Following @Itai comment, it appears that the condition if ("Do".equals(operation)) gets not evaluated because there no such operation is invoked. Consequently the processOperator from the super class is invoked.

The only operations that are invoked are (I added System.err.println("Processing " + operation); before the condition in the overriden processOperator method) :

Processing q Processing cm Processing gs Processing q Processing re Processing W Processing n Processing rg Processing re Processing f Processing cs Processing scn Processing re Processing f Processing Q Processing Q

==========

Any hints appreciated,

@Itai it reports the actual size on other PDFs. Even testing http://www.hourofthetime.com/1-LF/December2012/Hour_Of_The_Time_12282012-Northwoods_Justification_For_US_Military_Intervention.pdf which looks alike (as far as the first page is concerned) shows the right size. — HelloWorld, May 29 '18 at 08:10
I meant - are all sizes *for that picture* reported as 0? It should print the pixel size, unit size, actual size etc. Are all of them 0? — Itai, May 29 '18 at 08:11
Good guess @Itai! Indeed the `if (xobject instanceof PDImageXObject) {` gets not evaluated. And zero is the initialized value that I use. — HelloWorld, May 29 '18 at 08:15
There is no "Do" operations, so the operations get processed in the super class. Here are the operations that are processed : Processing q Processing cm Processing gs Processing q Processing re Processing W Processing n Processing rg Processing re Processing f Processing cs Processing scn Processing re Processing f Processing Q Processing Q — HelloWorld, May 29 '18 at 08:28
The image is neither immediately added from the page content nor from an embedded XObject. Instead it is used inside a pattern. The example code, though, only inspects the content of the page and embedded (probably nested) XObjects. Thus, your image is not found. This, by the way, also is the reason why you can't easily export the image from Adobe Reader... — mkl, May 29 '18 at 09:32

score 1 · Accepted Answer · answered May 29 '18 at 15:41

As you already have found out yourself, the reason for the 0x0 output is that the code from PrintImageLocations as-is cannot find the image at all.

PrintImageLocations does not find the image because it only looks for image uses in the page content and in form XObjects (also nested) used in the page content. In the file at hand, on the other hand, the image is drawn inside a tiling Pattern content which is used to fill an area in the page content.

To allow PDFBox to find this image, we have to extend the PrintImageLocations class a bit to also descent into pattern content streams, e.g. like this:

class PrintImageLocationsImproved extends PrintImageLocations {
    public PrintImageLocationsImproved() throws IOException {
        super();

        addOperator(new SetNonStrokingColor());
        addOperator(new SetNonStrokingColorN());
        addOperator(new SetNonStrokingDeviceCMYKColor());
        addOperator(new SetNonStrokingDeviceGrayColor());
        addOperator(new SetNonStrokingDeviceRGBColor());
        addOperator(new SetNonStrokingColorSpace());
    }

    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
        String operation = operator.getName();
        if (fillOperations.contains(operation)) {
            PDColor color = getGraphicsState().getNonStrokingColor();
            PDAbstractPattern pattern = getResources().getPattern(color.getPatternName());
            if (pattern instanceof PDTilingPattern) {
                processTilingPattern((PDTilingPattern) pattern, null, null);
            }
        }
        super.processOperator(operator, operands);
    }

    final List<String> fillOperations = Arrays.asList("f", "F", "f*", "b", "b*", "B", "B*");
}

(ExtractImageLocations inner class PrintImageLocationsImproved)

The tiling pattern in the document at hand is used as a pattern color for filling, not stroking. Thus, PrintImageLocationsImproved has to register operator listeners for non-stroking color operators to have the fill color correctly updated in the graphics state.

processOperator before delegating to the PrintImageLocations implementation now first checks whether the operator is a fill operation. In that case it inspects the current fill color. If it is a pattern color, processOperator initiates the processTilingPattern handling defined in PDFStreamEngine which starts a nested analysis of the pattern content stream and so eventually lets the PrintImageLocationsImproved find the image.

Using PrintImageLocationsImproved like this

try (   PDDocument document = PDDocument.load(...)    )
{
    PrintImageLocations printer = new PrintImageLocationsImproved();
    int pageNum = 0;
    for( PDPage page : document.getPages() )
    {
        pageNum++;
        System.out.println( "Processing page: " + pageNum );
        printer.processPage(page);
    }
}

(ExtractImageLocations test testExtractLikeHelloWorldImprovedFromTopSecret)

for your PDF file, therefore, will find the image:

Processing page: 1
*******************************************************************
Found image [R8]
position in PDF = 39.0, 102.48 in user space units
raw image size  = 1209, 1640 in pixels
displayed size  = 516.3119, 700.3752 in user space units
displayed size  = 7.1709986, 9.727433 in inches at 72 dpi rendering
displayed size  = 182.14336, 247.0768 in millimeters at 72 dpi rendering

Beware,

this is not not perfect fix, more a proof-of-concept and work-around, as it does neither properly restrict the pattern to the area actually filled nor return multiple finds for an area large enough to require multiple pattern tiles to fill. Nonetheless, it returns an image match for the file at hand..

Thanks a lot @mkl, I will give it a try! By the way how to you inspect such PDFs, in order to reveal how they are laid out / organized ? — HelloWorld, May 29 '18 at 16:09
There are tools to inspect PDFs like PDFBox PDFDebugger an iText RUPS. And some details already can be seen in a simple text viewer. You need some background knowledge in the PDF format, though. — mkl, May 29 '18 at 16:44
Hey thank you @mkl, I didn't know PDFBox existed as standalone java app (https://pdfbox.apache.org/1.8/commandline.html)! — HelloWorld, May 30 '18 at 08:09

Why does PDFBox return image dimension of size 0 x 0

1 Answers1

Beware,

Linked