11

I am trying to extract images from a PDF file. I found an example on the web, that worked fine:

    PdfReader reader;

    File file = new File("example.pdf");
    reader = new PdfReader(file.getAbsolutePath());
    for (int i = 0; i < reader.getXrefSize(); i++) {
        PdfObject pdfobj = reader.getPdfObject(i);
        if (pdfobj == null || !pdfobj.isStream()) {
            continue;
        }
        PdfStream stream = (PdfStream) pdfobj;
        PdfObject pdfsubtype = stream.get(PdfName.SUBTYPE);
        if (pdfsubtype != null && pdfsubtype.toString().equals(PdfName.IMAGE.toString())) {
            byte[] img = PdfReader.getStreamBytesRaw((PRStream) stream);
            FileOutputStream out = new FileOutputStream(new File(file.getParentFile(), String.format("%1$05d", i) + ".jpg"));
            out.write(img);
            out.flush();
            out.close();
        }
    }

That gave me all the images, but the images were in the wrong order. My next attempt looked like this:

for (int i = 0; i <= reader.getNumberOfPages(); i++) {
  PdfDictionary d = reader.getPageN(i);
  PdfIndirectReference ir = d.getAsIndirectObject(PdfName.CONTENTS);
  PdfObject o = reader.getPdfObject(ir.getNumber());
  PdfStream stream = (PdfStream) o;
  // rest from example above
}

Although o.isStream() == true, I only get /Length and /Filter and the stream is only about 100 bytes long. No image to be found at all.

My question would be what the correct way would be to get all the images from a PDF file in the correct order.

nratx
  • 169
  • 1
  • 2
  • 8

1 Answers1

0

I found an answer elsewhere, namely the iText mailing list.

The following code works for me - please note that I switched to PdfBox:

PDDocument document = null; 
document = PDDocument.load(inFile); 
List pages = document.getDocumentCatalog().getAllPages();
Iterator iter = pages.iterator(); 
while (iter.hasNext()) {
            PDPage page = (PDPage) iter.next();
            PDResources resources = page.getResources();
            Map pageImages = resources.getImages();
            if (pageImages != null) { 
                Iterator imageIter = pageImages.keySet().iterator();
                while (imageIter.hasNext()) {
                    String key = (String) imageIter.next();
                    PDXObjectImage image = (PDXObjectImage) pageImages.get(key);
                    image.write2OutputStream(/* some output stream */);
                }
            }
}
Lonzak
  • 9,334
  • 5
  • 57
  • 88
nratx
  • 169
  • 1
  • 2
  • 8
  • Is PDXObjectImage part of iText too? can't seem to find it – Filipe Correia May 23 '12 at 16:15
  • 6
    @FilipeCorreia nratx forgot to mention that he switched to Apache PDFBox. – matt Nov 21 '12 at 09:36
  • For some PDF files the line `PDResources resources = page.getResources();` will need to be replaced with `PDResources resources = page.findResources();` – Tim Aug 01 '13 at 01:31
  • This code still extracts images in the wrong order (tested with PDFBox 1.6 and 1.8) – Roman Malieiev Aug 06 '15 at 11:03
  • 2
    This [answer](http://stackoverflow.com/questions/14120748/how-can-extract-images-from-pdf-file-using-itext-library-in-my-android-applicati) worked for me with itext, preserving the order and extracting both jpg and png images – Roman Malieiev Aug 10 '15 at 13:28
  • I couldn't find the method getImages and getResources – Bassel Kh Jun 13 '17 at 21:15
  • Why is this answer accepted? It is completely irrelevant to the question, since it is for a different library. If I ask "how do I get from A to B via train" and someone replies with a list of flights from A to B, that's not a valid answer. I get that it's how you solved your functional problem, but it's not what the original question was about, and answers like this make it more difficult to navigate SO. – Demonblack Aug 07 '19 at 10:05
  • The answer does not match the question. Please update the question to avoid having folks looking for an iText answer from seeing a PdfBox answer OR consider deleting your question. – David Yates Dec 26 '19 at 20:50