I'm trying to perform OCR on pdfs. There are 2 steps in the code:
- Convert pdf to tiff files
- Convert tiff to text
I used ghost4j for the first step, and then tess4j for the second one. all worked great, until I started to do run it multi-threaded, and then strange exceptions occurred. I read here: https://sourceforge.net/p/tess4j/discussion/1202293/thread/44cc65c5/ that ghost4j is not suitable for multi-threaded, so I changed the first step to work with PDFBox.
So now my code looks like:
PDDocument doc = PDDocument.load(this.bytes);
PDFRenderer pdfRenderer = new PDFRenderer(doc);
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 300);
ByteArrayOutputStream os = new ByteArrayOutputStream();
ImageIO.write(bufferedImage, "tiff", os);
os.flush();
os.close();
bufferedImage.flush();
I'm trying to run this code with a 800 kb pdf file, and when checking the memory after the
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 300);
it raise to more than 500 MB!! if i'm saving this BufferedImage to disk the output is 1 MB size...so when trying to run this code with 8 threads, I'm getting the java heap size exception also...
What am I missing here? why a 1 MB file results in a 500 MB image file? I tried to play with the DPI and reduce the quality but the file is still very big... Is there any other library that can render pdf to tiff, and that I could execute 10 threads without memory issues?
Steps to reproduce:
Download the Linkedin CEO resume file from here - https://gofile.io/?c=TtA7XQ
I than used this code:
private static void test() throws IOException { printUsedMemory("App started..."); File file = new File("linkedinceoresume.pdf"); try (PDDocument doc = PDDocument.load(file)) { PDFRenderer pdfRenderer = new PDFRenderer(doc); printUsedMemory("Before"); for (int page = 0; page < 1; ++page) { BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 76, ImageType.GRAY); ByteArrayOutputStream os = new ByteArrayOutputStream(); ImageIO.write(bufferedImage, "tiff", os); os.flush(); os.close(); bufferedImage.flush(); } } finally { printUsedMemory("BufferedImage"); } } private static void printUsedMemory(String text) { long freeMemory = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory(); long mb = freeMemory / 1000000; System.out.println(text + "....Used memory: " + mb + " MB"); }
and the output is:
App started.......Used memory: 42 MB
Before....Used memory: 107 MB
BufferedImage....Used memory: 171 MB
In this example it's not 500 MB, but a pdf of 70 kb, when I try to render only one page, the memory increase in about 70 MB...it's not proportional...