8

I'm trying to perform OCR on pdfs. There are 2 steps in the code:

  1. Convert pdf to tiff files
  2. Convert tiff to text

I used ghost4j for the first step, and then tess4j for the second one. all worked great, until I started to do run it multi-threaded, and then strange exceptions occurred. I read here: https://sourceforge.net/p/tess4j/discussion/1202293/thread/44cc65c5/ that ghost4j is not suitable for multi-threaded, so I changed the first step to work with PDFBox.

So now my code looks like:

PDDocument doc = PDDocument.load(this.bytes);
PDFRenderer pdfRenderer = new PDFRenderer(doc);
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 300);
ByteArrayOutputStream os = new ByteArrayOutputStream();
ImageIO.write(bufferedImage, "tiff", os);
os.flush();
os.close();
bufferedImage.flush();

I'm trying to run this code with a 800 kb pdf file, and when checking the memory after the

BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(0, 300);

it raise to more than 500 MB!! if i'm saving this BufferedImage to disk the output is 1 MB size...so when trying to run this code with 8 threads, I'm getting the java heap size exception also...

What am I missing here? why a 1 MB file results in a 500 MB image file? I tried to play with the DPI and reduce the quality but the file is still very big... Is there any other library that can render pdf to tiff, and that I could execute 10 threads without memory issues?

Steps to reproduce:

  1. Download the Linkedin CEO resume file from here - https://gofile.io/?c=TtA7XQ

  2. I than used this code:

    private static void test() throws IOException {
        printUsedMemory("App started...");
        File file = new File("linkedinceoresume.pdf");
        try (PDDocument doc = PDDocument.load(file)) {
            PDFRenderer pdfRenderer = new PDFRenderer(doc);
            printUsedMemory("Before");
            for (int page = 0; page < 1; ++page) {
                BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 76, ImageType.GRAY);
                ByteArrayOutputStream os = new ByteArrayOutputStream();
                ImageIO.write(bufferedImage, "tiff", os);
                os.flush();
                os.close();
                bufferedImage.flush();
            }
        } finally {
            printUsedMemory("BufferedImage");
        }
    }
    
    private static void printUsedMemory(String text) {
        long freeMemory = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
        long mb = freeMemory / 1000000;
        System.out.println(text + "....Used memory: " + mb + " MB");
    }
    

and the output is:

App started.......Used memory: 42 MB

Before....Used memory: 107 MB

BufferedImage....Used memory: 171 MB

In this example it's not 500 MB, but a pdf of 70 kb, when I try to render only one page, the memory increase in about 70 MB...it's not proportional...

Community
  • 1
  • 1
Lior Y
  • 250
  • 1
  • 3
  • 13
  • 2
    Please share the PDF file. Maybe if has a huge image dimension output size? – Tilman Hausherr Jan 07 '20 at 11:31
  • Can you check the dimensions of your `BufferedImage` after rendering? – T A Jan 07 '20 at 11:40
  • Unfortunately, I can't share the file. When using the DPI 300, the dimensions are 3300 X 2550. changing to DPI 76 the dimensions are 836 X 646...memory size is almost the same Tested with 4-5 different pdf files, all results with this huge size... – Lior Y Jan 07 '20 at 12:03
  • It is propably not because of the coversion then. Maybe you have a memory leak elsewhere? – T A Jan 07 '20 at 12:05
  • 3
    Note that high memory consumption doesn't necessarily indicate a memory leak. Perhaps the page contains a bitmap object that needs a lot of memory to decode? Does PDFBox subsample images when rendering at smaller sizes? If not, rendering at a small size may not help... – Harald K Jan 07 '20 at 12:22
  • 1
    Pdfbox does not subsample by default but it can be enabled in PDFRenderer. – Tilman Hausherr Jan 07 '20 at 12:26
  • I edited the main post with some info how to reproduce – Lior Y Jan 07 '20 at 12:29
  • Have you see https://stackoverflow.com/questions/6437564/create-a-tiff-with-only-text-and-no-images-from-a-postscript-file-with-ghostscri ? – Tobias Otto Jan 07 '20 at 14:21
  • @TobiasOtto not sure how is that pose can help... – Lior Y Jan 08 '20 at 08:56
  • @TilmanHausherr What are we suppose to do when we have a PDF that includes a picture with big resolution? For testing purpose, I created a small pdf (1 Mo) with only one page that includes an image with a huge resolution, it consumed more than 4 Go to convert it into an image. Is there something we can do to avoid that? – Nicolas Filotto Jan 13 '20 at 11:29
  • 1
    @NicolasFilotto activate subsampling in PDFRenderer. But subsampling is probably not a good idea for OCR. – Tilman Hausherr Jan 13 '20 at 11:41
  • @TilmanHausherr I confirm that it works well in my case, thank you for the tip – Nicolas Filotto Jan 13 '20 at 11:50

2 Answers2

0

A dimension 3300 X 2550 of one byte per pixel would deliver around 70_000_000 bytes. With 150 dpi one would have 22 inch by 17 inch, way too huge.

So scale the picture down to approx. 17 MB memory:

    float scale = 0.5f;
    BufferedImage bufferedImage = pdfRenderer.renderImage(page, scale, ImageType.BINARY);

Save it as png rather than tiff to see whether that makes a difference.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • The OP wants to do OCR, so 300dpi is a good choice. But you're right on the image type, I have made the same suggestion in PDFBOX-4739. (It also came out that the images are saved uncompressed) – Tilman Hausherr Jan 13 '20 at 11:39
  • @TilmanHausherr I partly do OCR with 150 dpi successfully but indeed 300 dpi is the norm. Using a ByteArrayOutputStream as above might be costly too, – Joop Eggen Jan 13 '20 at 12:12
0

The issue was solved in the discussion in PDFBOX-4739:

  • use ImageIOUtils.writeImage() instead of ImageIO.write() (you will need the tools subproject), because ImageIO doesn't compress TIFF files. ImageIOUtils tries to use LZW or CCITT, depending on the source image.
  • don't save the image at all: there is a doOCR() method that takes a BufferedImage as parameter, so no need to save at all.
Tilman Hausherr
  • 17,731
  • 7
  • 58
  • 97