0

I have an issue with PDFBox 2.0.1 as its not being able to render a PDF. I wouldn’t really care if PDFBox fails on a couple of files, but the thing is that the entire thread hangs and never returns for several minutes and the memory keeps building up and there doesn't seem to be an end in sight.

The problem seems to be with RenderImageWithDPI, this is how I call it:

PDFRenderer renderer = new PDFRenderer(document);
BufferedImage image = renderer.renderImageWithDPI(0, 96); //Gets stuck here
ImageIO.write(image, "PNG", new File(fileName));

The code gets stuck on that particular line and consumes CPU and memory. In netbeans I see this stack trace whenever I pause execution. Though I am not sure what is happening as I see PDFBox working but seems to have hit some sort of infinite loop.

enter image description here

The PDF in question can be downloaded from: https://drive.google.com/file/d/0B5zMlyl8rHwsY3Y1WjFVZlllajA/view?usp=sharing

Can someone help pls?

Zaid Amir
  • 4,727
  • 6
  • 52
  • 101

2 Answers2

4

Are you on java 8 or java 9? As explained here, start java with this option:

-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider 

this is related to JDK8/9 having changed their color management system.

The file is still slow to render (20-30 seconds), because it is very complex.

(Btw rendering didn't hang. It just took very, very long, i.e. several minutes)

New since PDFBox 2.0.9: you mentioned you're creating thumbnails. You can now enable subsampling with PDFRender.setSubsamplingAllowed(true), this will reduce the memory used for images.

Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97
  • Thanks for the reply, this worked brilliantly. I still have some reservation in using KCMS, the code isa post ofa big service that creates thumbnails if various file types not just pdfs. Are there any known issues for everything to KCMS? I tried a couple of files and it all seem to work though i am concerned for the long run – Zaid Amir Jun 09 '16 at 19:05
  • @ZaidAmir I haven't heard of any problems. KCMS is what was used until JDK7. It's a known problem, it's not just us and Oracle just shows the middle finger. https://bugs.openjdk.java.net/browse/JDK-8041125 https://blog.idrsolutions.com/2014/04/color-performance-change-newer-java-releases/ (that's from our competition) – Tilman Hausherr Jun 09 '16 at 19:19
0

The issue can be reproduced in a Java 8 VM. As @Tilman already mentioned in his answer, it is an issue introduced by Java 8 using a different the color management system than the former Java versions.

Analyzing the VM behavior with the new color management system it becomes clear that the issue is not really a memory leak issue (as could be conjectured due to the excessive memory use); instead objects are instantiated faster than garbage collection can collect and free unused objects!

One can allow garbage collection to fetch up by changing the main loop of page content parsing in PDFStreamEngine.processStreamOperators(PDContentStream):

int i = 1;                         // new
while (token != null)
{
    if (token instanceof COSObject)
    {
        arguments.add(((COSObject) token).getObject());
    }
    else if (token instanceof Operator)
    {
        processOperator((Operator) token, arguments);
        arguments = new ArrayList<COSBase>();
    }
    else
    {
        arguments.add((COSBase) token);
    }
    token = parser.parseNextToken();
    if (i++ % 1000 == 0)           // new
        Runtime.getRuntime().gc(); // new
}

(1000 being an arbitrary value I chose out of thin air.)

This still is slow but it eventually creates the bitmap without excessive memory usage.


Thus, it looks like the older color management system instantiated way less temporary objects and/or explicitly allowed garbage collection to step in.


PS: The change above does not speed things up. It merely prevents the excessive memory use the OP observed and which in my test setup resulted in an OutOfMemory situation.

If the OP has full control over the deployment environment, he should indeed use the option @Tilman showed in his answer. If the OP does not, though, e.g. if he eventually deploys onto a web server he does not administrate and if the administrators do not want to add to the JVM start options, he can at least prevent the excessive memory use.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • https://stackoverflow.com/questions/2414105/why-is-it-bad-practice-to-call-system-gc – Tilman Hausherr Jun 09 '16 at 16:31
  • @Tilman that question is nice for generic statements on that issue but as always generic answers tend to be ignorable given the right circumstances. – mkl Jun 09 '16 at 19:06
  • I tried your change, and it is still slow. With the setting, it takes 20-30sec. Without the setting but with your change, it takes 1000 seconds in PDFDebugger (72dpi). – Tilman Hausherr Jun 10 '16 at 07:21
  • @TilmanHausherr Yes, of course it's slow (my change does not speed things up) but it doesn't run into a OutOfMemory situation anymore which it did before the change in my test setup. If the OP has full control over the deployment environment, he should indeed use the option you showed in your answer. If he doesn't, though, e.g. if he eventually deploys onto a web server he does not administrate and if the administrators do not want to add to the JVM start options, he can at least prevent the excessive memory use one has without that change. I'll add a remark to this effect to the answer. – mkl Jun 10 '16 at 08:59