1

I was trying to render an image and I was getting out of memory error in this line.

 try{
     BufferedImage image = pdfRenderer.renderImageWithDPI(page-1, 300,ImageType.GRAY);
     ImageIOUtil.writeImage(image,"G:/Trial/tempImg.png", 300);
     int bpp = image.getColorModel().getPixelSize();
     int bytespp = bpp / 8;
     int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
     int height = image.getHeight();
     int width = image.getWidth();

     TessAPI1.TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), width, height, bytespp, bytespl);
     TessAPI1.TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO);
    //codes
    }
    finally
    {
    //some code so that this function could be called again with next pdf
    //some code to release resources
    }

In this code segment first I am rendering a particular page from a pdf document to BufferedImage and then I am converting the bufferedImage to Bytes before providing it to tesseract. It is at this point that I am getting an Out of memory Error.

Normally when you get an out of memory error you get one more message beside it , either out of heap or out of perm. But, here I am getting just an out of memory error. Please explain this.

When I was debugging this code I observed that the code doesn't terminate at the line where I am converting image to byte but rather it goes to finally block (I am using try and finally block for this code segment). So I put a continue in my finally and voila my code was running perfectly for next set of pdfs.

So my question is how is my program not exiting after the out of memory error(not that I want it to not work) but If memory is really full then how can the code load next set of pdfs. An insight on this would be really wonderful. Thanks

P.S - This problem is solved due to that hack and my code is working but I am curious as to why all this is happening.

ANKIT
  • 126
  • 2
  • 11
  • please post the stack trace – Jos Jun 25 '16 at 08:05
  • I would post the stack trace as soon as possible, currently my pc is not free and I can't run 2 instances of program which use tesseract. – ANKIT Jun 25 '16 at 09:59
  • 1
    Coincidentally, I have observed the same with an application that runs PDFBox preflight on 250000 PDF files. It calls `e.printStackTrace(PrintWriter);` to a file in a `catch (Throwable e)` segment, and sometimes only the first line comes out. My thought was that the memory management was so messed up at that time that it didn't have the resources to print the stack trace. – Tilman Hausherr Jun 25 '16 at 12:36
  • 1
    @TilmanHausherr, I think you are correct . Basically this code is part of my project where I am running OCR on many pdf files. Since some images are large I was getting the above error which was bypassed using finally , but after running the code for 7-8 hours I found that my code was bypassing evry pdf (after a certain limit). I can only conclude that this is due to memory problem. jvm doesn't have enough resource to render any more images. I think there is memory leak in my program. Can you confirm that pdfbox image classes doesn't contribute to this leak because i cant find leak my code. – ANKIT Jun 25 '16 at 16:06
  • Who knows... I can't find a memory leak unless I get a specific scenarion. And even then, it is not always possible to reproduce it: https://issues.apache.org/jira/browse/PDFBOX-3334 – Tilman Hausherr Jun 25 '16 at 16:11
  • I think there is some leak in pdf renderer or image util. My code was working fine before for large number of pdfs. But when I started rendering pages my heap is continuosly increasing. :( Can't seem to resolve this. – ANKIT Jun 25 '16 at 16:17
  • Yes, There are also some ttf objects in my heap but its only around 2 mb so I didn't mentioned that. – ANKIT Jun 25 '16 at 16:22
  • The memory leak could also be in ImageIO: https://stackoverflow.com/questions/8279252/how-to-decode-jpx-images-in-java – Tilman Hausherr Jun 25 '16 at 16:38
  • @ANKIT did you solve this? I am currently stuck at this :/ Some help please. :) – Pramesh Bajracharya Aug 23 '19 at 13:06
  • @PrameshBajracharya I believe I had to assume there was memory leak at pdf library. I wasn't able to do much because of that. But as I mentioned, my code was working because of putting continue inside of finally block. – ANKIT Sep 10 '19 at 14:06
  • @ANKIT yea true, turns out `PDFRenderer` was the culprit for the out of memory error here. But the good news is that it only happens in older versions. For future readers, you might want to upgrade to newer versions of `PDFBOX` and if this issue still persists then you can consider batch processing of your PDFs. What I did was divided the PDFs into batches and if a PDF had more than 30 pages I splitted it into groups of 30s. This way I did not get OOM error and was able to run through 10GB + of PDFs. Hope this helps someone :) – Pramesh Bajracharya Sep 10 '19 at 14:55

0 Answers0