Convert Pdf pages to Byte array using Itext

Question

My Question

I'm looking for a way to convert the individual pdf pages into a byte[] (as in one byte[] per pdf page) so that I can then cast them to BufferedImage[].

This way, all the conversion is done in memory instead of making temporary files, making it faster and less messy. I may use the byte array for service calls later on as well. It would be nice if I could keep the library use to only itext, however, if there isn't any other way, I'm open to other libraries.

What I have now

This is the code that I currently have

public static BufferedImage toBufferedImage(byte[] input) throws IOException {
    InputStream in = new ByteArrayInputStream(input);
    BufferedImage bimg = ImageIO.read(in);
    return bimg;
}

public static BufferedImage[] extract(final String fileName) throws IOException {
    PdfReader reader = new PdfReader(fileName);
    int pageNum = reader.getNumberOfPages();
    BufferedImage[] imgArray = new BufferedImage[pageNum];

    for (int page = 0; page < pageNum; page++) {
        //TODO: You may need to decode the bytearray first?
        imgArray[page] = toBufferedImage(reader.getPageContent(pageNum)); 
    }

    reader.close();
    return imgArray;
}

public static void convert() throws IOException {
    String fileName = getProps("file_in");
        BufferedImage[] bim = extract(fileName);
        // close streams; Closed implicitily by try-with-resources

}

And here's a (non-representative) list of the links that I've checked out so far.

Useful, but not quite what I want

Uses a different library

First, by "converting a PDF to an image" do you really mean "extract existing images from a PDF"? Looking at your code it appears to be about extracting and not converting. — Chris Haas, Jun 17 '16 at 16:06
@ChrisHaas Well the goal is to convert it. Right now, the thing that itext is doing (as far as I can tell) is making each page in the pdf a seperate jpg file. Then each jpg file is merged into a multipage tiff. I want to stop making local temporary files, and do this `pdf -> byte[] -> BufferedImage -> MultipageTiff` all in memory — Scrambo, Jun 17 '16 at 17:11
Just like @Chris said, your question as a whole is not clear, it's partially about rendering pages as images and partially about extracting bitmap images from the pages. Itext does not (yet) include an image rendering API but it does have a bitmap extraction API. — mkl, Jun 17 '16 at 17:15
`pdf -> byte[] -> BufferedImage -> MultipageTiff` - what do you expect that `byte[]` to contain? — mkl, Jun 17 '16 at 17:16
I've slimmed down the question to make it more clear hopefully. @mkl So if I understand what you're saying in your first comment, itext doesn't make images, it extracts them? That's what I've For your second comment, I may have been a little unclear. I want to try to extract each page in the pdf to a seperate byte[], not the whole pdf to a single byte[]. — Scrambo, Jun 17 '16 at 17:46
*I want to try to extract each page in the pdf to a seperate byte[]* - yes, you said so, but as what do you expect the page to be represented in that `byte[]`? You say you want to *cast them to BufferedImage[].* Obviously you cannot *cast*. You seem to look for code that draws the page as a bitmap image. Itext cannot do that yet out of the box. — mkl, Jun 17 '16 at 20:54

Scrambo · Accepted Answer · 2016-06-20T17:22:45.363

I did some digging and came up with a solution! Hopefully someone else finds this when they need it, and that it helps as much as possible. Cheers!

Extending the RenderListener Class

I looked around and found this. Looking through the code and classes, I found that PdfImageObjects have a getBufferedImage() which is exactly what I was looking for. Now there's no need to convert to a byte[], which is what I originally thought I was going to have to do. Using the given example code, I came up with this class:

public class MyImageRenderListener implements RenderListener {

protected String path = "";
protected ArrayList<BufferedImage> bimg = new ArrayList<>();

/**
 * Creates a RenderListener that will look for images.
 */
public MyImageRenderListener(String path) {
    this.path = path;
}

public ArrayList<BufferedImage> getBimgArray() {
    return bimg;
}

/**
 * @see com.itextpdf.text.pdf.parser.RenderListener#renderImage(
 * com.itextpdf.text.pdf.parser.ImageRenderInfo)
 */
public void renderImage(ImageRenderInfo renderInfo) {
    try {

        PdfImageObject image = renderInfo.getImage();

        if (image == null) {
            return;
        }
        bimg.add(image.getBufferedImage());

    } catch (IOException e) {
        System.out.println(e.getMessage());
    }
}

Important changes to notice here compared to the link above are the additions of a new field ArrayList<BufferedImage> bimg, a getter for that field, and a restructuring of the renderImage() function.

I also changed some of the methods in the other class of my project:

Code for Bursting PDF to BufferedImage[]

// Credit to Mihai. Code found here: http://stackoverflow.com/questions/6851385/save-tiff-ccittfaxdecode-from-pdf-page-using-itext-and-java
public static ArrayList<BufferedImage> getBufImgArr(final String BasePath) throws IOException {

    PdfReader reader = new PdfReader(BasePath);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    MyImageRenderListener listener = new MyImageRenderListener(BasePath + "extract/image%s.%s");

    for (int page = 1; page <= reader.getNumberOfPages(); page++) {
        parser.processContent(page, listener);
    }

    reader.close();
    return listener.getBimgArray();

}

Code for Converting BufferedImage[] to Multi-Page Tiff

public static void convert(String fin) throws FileNotFoundException, IOException {

    ArrayList<BufferedImage> bimgArrL = getBufImgArr(fin);
    BufferedImage[] bim = new BufferedImage[bimgArrL.size()];
    bimgArrL.toArray(bim);

    try (RandomAccessOutputStream rout = new FileCacheRandomAccessOutputStream(
        new FileOutputStream("/path/you/want/result/to/go.tiff"))) {

        // The options for the tiff file are set here. 
        // **THIS BLOCK USES THE ICAFE LIBRARY TO CONVERT TO MULTIPAGE-TIFF**
        // ICAFE: https://github.com/dragon66/icafe
        ImageParam.ImageParamBuilder builder = ImageParam.getBuilder();
        TIFFOptions tiffOptions = new TIFFOptions();
        tiffOptions.setApplyPredictor(true);
        tiffOptions.setTiffCompression(Compression.CCITTFAX4);
        tiffOptions.setDeflateCompressionLevel(0);
        builder.imageOptions(tiffOptions);
        TIFFTweaker.writeMultipageTIFF(rout, bim);
        // I found this block of code here: https://github.com/dragon66/icafe/wiki
        // About 3/4 of the way down the page

    }
}

To kick off the whole process:

public static void main(String[] args){
    convert("/path/to/pdf/image.pdf");
}

IMPORTANT TO NOTE:

You may notice that listener.renderImage() is never explicitly called in my code. It seems that renderImage() is a helper function that is called somewhere else when the listener object is passed into the parser object. This happens in the getBufImgArr(param) method.

As @mkl in the comments below has noted, the code is extracting all images in the pdf page, since a pdf page isn't an image in and of itself. Problems may occur if you're running this code on pdf's that were scanned in using OCR, or pdf's that have multiple layers. In this scenario, you'd have multiple images from a single pdf page being converted into multiple tiff images, when you (may) want them to stay together on a single page.

Good sources I found:

Programcreek search for PdfReaderContentParser

In contrast to your question your code **(A)** in general does not render the whole page but instead only extracts the bitmap images from the page --- in case of scanned PDFs these notions may coincide, though --- and **B** there are no `byte` arrays visible at all here. If you had asked for a way to extract embedded bitmap images from a PDF from the beginning, you'd have had an answer very quickly. — mkl, Jun 20 '16 at 13:58
It seems I didn't quite get what pdf's were / how they stored data inside. It makes more sense now to think of these certain pdf files as images inside a container. As for **B**, that seems to be a result of me not fully understanding the problem scope, and what was actually needed. Either way, I appreciate you taking the time to explain and help me out @mkl — Scrambo, Jun 20 '16 at 14:36