Small pdf files results with huge BufferdImage

Question

I'm trying to perform OCR on pdfs. There are 2 steps in the code:

Convert pdf to tiff files
Convert tiff to text

I used ghost4j for the first step, and then tess4j for the second one. all worked great, until I started to do run it multi-threaded, and then strange exceptions occurred. I read here: https://sourceforge.net/p/tess4j/discussion/1202293/thread/44cc65c5/ that ghost4j is not suitable for multi-threaded, so I changed the first step to work with PDFBox.

So now my code looks like:

PDDocument doc = PDDocument.load(this.bytes);
PDFRenderer pdfRenderer = new PDFRenderer(doc);
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 300);
ByteArrayOutputStream os = new ByteArrayOutputStream();
ImageIO.write(bufferedImage, "tiff", os);
os.flush();
os.close();
bufferedImage.flush();

I'm trying to run this code with a 800 kb pdf file, and when checking the memory after the

BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 300);

it raise to more than 500 MB!! if i'm saving this BufferedImage to disk the output is 1 MB size...so when trying to run this code with 8 threads, I'm getting the java heap size exception also...

What am I missing here? why a 1 MB file results in a 500 MB image file? I tried to play with the DPI and reduce the quality but the file is still very big... Is there any other library that can render pdf to tiff, and that I could execute 10 threads without memory issues?

Steps to reproduce:

Download the Linkedin CEO resume file from here - https://gofile.io/?c=TtA7XQ

I than used this code:

private static void test() throws IOException {
    printUsedMemory("App started...");
    File file = new File("linkedinceoresume.pdf");
    try (PDDocument doc = PDDocument.load(file)) {
        PDFRenderer pdfRenderer = new PDFRenderer(doc);
        printUsedMemory("Before");
        for (int page = 0; page < 1; ++page) {
            BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 76, ImageType.GRAY);
            ByteArrayOutputStream os = new ByteArrayOutputStream();
            ImageIO.write(bufferedImage, "tiff", os);
            os.flush();
            os.close();
            bufferedImage.flush();
        }
    } finally {
        printUsedMemory("BufferedImage");
    }
}

private static void printUsedMemory(String text) {
    long freeMemory = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    long mb = freeMemory / 1000000;
    System.out.println(text + "....Used memory: " + mb + " MB");
}

and the output is:

App started.......Used memory: 42 MB

Before....Used memory: 107 MB

BufferedImage....Used memory: 171 MB

In this example it's not 500 MB, but a pdf of 70 kb, when I try to render only one page, the memory increase in about 70 MB...it's not proportional...

Please share the PDF file. Maybe if has a huge image dimension output size? — Tilman Hausherr, Jan 07 '20 at 11:31
Can you check the dimensions of your `BufferedImage` after rendering? — T A, Jan 07 '20 at 11:40
Unfortunately, I can't share the file. When using the DPI 300, the dimensions are 3300 X 2550. changing to DPI 76 the dimensions are 836 X 646...memory size is almost the same Tested with 4-5 different pdf files, all results with this huge size... — Lior Y, Jan 07 '20 at 12:03
It is propably not because of the coversion then. Maybe you have a memory leak elsewhere? — T A, Jan 07 '20 at 12:05
Note that high memory consumption doesn't necessarily indicate a memory leak. Perhaps the page contains a bitmap object that needs a lot of memory to decode? Does PDFBox subsample images when rendering at smaller sizes? If not, rendering at a small size may not help... — Harald K, Jan 07 '20 at 12:22
Pdfbox does not subsample by default but it can be enabled in PDFRenderer. — Tilman Hausherr, Jan 07 '20 at 12:26
Have you see https://stackoverflow.com/questions/6437564/create-a-tiff-with-only-text-and-no-images-from-a-postscript-file-with-ghostscri ? — Tobias Otto, Jan 07 '20 at 14:21
@TilmanHausherr What are we suppose to do when we have a PDF that includes a picture with big resolution? For testing purpose, I created a small pdf (1 Mo) with only one page that includes an image with a huge resolution, it consumed more than 4 Go to convert it into an image. Is there something we can do to avoid that? — Nicolas Filotto, Jan 13 '20 at 11:29
@NicolasFilotto activate subsampling in PDFRenderer. But subsampling is probably not a good idea for OCR. — Tilman Hausherr, Jan 13 '20 at 11:41
@TilmanHausherr I confirm that it works well in my case, thank you for the tip — Nicolas Filotto, Jan 13 '20 at 11:50

score 0 · Answer 1 · answered Jan 13 '20 at 11:31

0

A dimension 3300 X 2550 of one byte per pixel would deliver around 70_000_000 bytes. With 150 dpi one would have 22 inch by 17 inch, way too huge.

So scale the picture down to approx. 17 MB memory:

    float scale = 0.5f;
    BufferedImage bufferedImage = pdfRenderer.renderImage(page, scale, ImageType.BINARY);

Save it as png rather than tiff to see whether that makes a difference.

answered Jan 13 '20 at 11:31

Joop Eggen

107,315
7
83
138

The OP wants to do OCR, so 300dpi is a good choice. But you're right on the image type, I have made the same suggestion in PDFBOX-4739. (It also came out that the images are saved uncompressed) – Tilman Hausherr Jan 13 '20 at 11:39
@TilmanHausherr I partly do OCR with 150 dpi successfully but indeed 300 dpi is the norm. Using a ByteArrayOutputStream as above might be costly too, – Joop Eggen Jan 13 '20 at 12:12

Tilman Hausherr · Answer 2 · 2020-01-13T18:57:56.860

0

The issue was solved in the discussion in PDFBOX-4739:

use ImageIOUtils.writeImage() instead of ImageIO.write() (you will need the tools subproject), because ImageIO doesn't compress TIFF files. ImageIOUtils tries to use LZW or CCITT, depending on the source image.
don't save the image at all: there is a doOCR() method that takes a BufferedImage as parameter, so no need to save at all.

edited Jan 13 '20 at 18:57

answered Jan 13 '20 at 17:32

Tilman Hausherr

17,731
7
58
97

Small pdf files results with huge BufferdImage

2 Answers2