As you already have found out yourself, the reason for the 0x0 output is that the code from PrintImageLocations
as-is cannot find the image at all.
PrintImageLocations
does not find the image because it only looks for image uses in the page content and in form XObjects (also nested) used in the page content. In the file at hand, on the other hand, the image is drawn inside a tiling Pattern content which is used to fill an area in the page content.
To allow PDFBox to find this image, we have to extend the PrintImageLocations
class a bit to also descent into pattern content streams, e.g. like this:
class PrintImageLocationsImproved extends PrintImageLocations {
public PrintImageLocationsImproved() throws IOException {
super();
addOperator(new SetNonStrokingColor());
addOperator(new SetNonStrokingColorN());
addOperator(new SetNonStrokingDeviceCMYKColor());
addOperator(new SetNonStrokingDeviceGrayColor());
addOperator(new SetNonStrokingDeviceRGBColor());
addOperator(new SetNonStrokingColorSpace());
}
@Override
protected void processOperator(Operator operator, List<COSBase> operands) throws IOException {
String operation = operator.getName();
if (fillOperations.contains(operation)) {
PDColor color = getGraphicsState().getNonStrokingColor();
PDAbstractPattern pattern = getResources().getPattern(color.getPatternName());
if (pattern instanceof PDTilingPattern) {
processTilingPattern((PDTilingPattern) pattern, null, null);
}
}
super.processOperator(operator, operands);
}
final List<String> fillOperations = Arrays.asList("f", "F", "f*", "b", "b*", "B", "B*");
}
(ExtractImageLocations inner class PrintImageLocationsImproved
)
The tiling pattern in the document at hand is used as a pattern color for filling, not stroking. Thus, PrintImageLocationsImproved
has to register operator listeners for non-stroking color operators to have the fill color correctly updated in the graphics state.
processOperator
before delegating to the PrintImageLocations
implementation now first checks whether the operator is a fill operation. In that case it inspects the current fill color. If it is a pattern color, processOperator
initiates the processTilingPattern
handling defined in PDFStreamEngine
which starts a nested analysis of the pattern content stream and so eventually lets the PrintImageLocationsImproved
find the image.
Using PrintImageLocationsImproved
like this
try ( PDDocument document = PDDocument.load(...) )
{
PrintImageLocations printer = new PrintImageLocationsImproved();
int pageNum = 0;
for( PDPage page : document.getPages() )
{
pageNum++;
System.out.println( "Processing page: " + pageNum );
printer.processPage(page);
}
}
(ExtractImageLocations test testExtractLikeHelloWorldImprovedFromTopSecret
)
for your PDF file, therefore, will find the image:
Processing page: 1
*******************************************************************
Found image [R8]
position in PDF = 39.0, 102.48 in user space units
raw image size = 1209, 1640 in pixels
displayed size = 516.3119, 700.3752 in user space units
displayed size = 7.1709986, 9.727433 in inches at 72 dpi rendering
displayed size = 182.14336, 247.0768 in millimeters at 72 dpi rendering
Beware,
this is not not perfect fix, more a proof-of-concept and work-around, as it does neither properly restrict the pattern to the area actually filled nor return multiple finds for an area large enough to require multiple pattern tiles to fill. Nonetheless, it returns an image match for the file at hand..