How to Extract Images from a PDF Form with iText

Question

This article (How to extract images from a PDF with iText in the correct order?) explains how to pull images from a regular PDF file. I need to extract an image that a user has entered into a PDF form field.

I use iText 7. I can access the form fields in iText with code like this:

PdfReader reader = new PdfReader(new FileInputStream(new ClassPathResource("myFile.pdf").getFile()));
PdfDocument document = new PdfDocument(reader);
PdfAcroForm acroForm = PdfAcroForm.getAcroForm(document, false);
Map<String, PdfFormField> fields = acroForm.getFormFields();
PdfButtonFormField imageField = null;
PdfDictionary dictionary = null;
for (String fldName : fields.keySet()) {
      PdfFormField field = fields.get(fldName);
      if ("Image1_af_image".equals(fldName)) {
            imageField = (PdfButtonFormField)fields.get("Image1_af_image");
            dictionary = imageField.getPdfObject();
       }
}

where Image1_af_imgage is the default name of an image field in the form. Is it possible to extract an image stream from the PdfButtonFormField or its associated dictionary object?

Thank your for your very helpful response. I have incorporated your code as follows:

    public void iTextTest3() throws IOException {

        PdfReader reader = new PdfReader(new FileInputStream(new ClassPathResource("templates/TestForm.pdf").getFile()));

        PdfDocument document = new PdfDocument(reader);
        String fieldname = "Image1_af_image";
        PdfAcroForm acroForm = PdfAcroForm.getAcroForm(document, false);

        PdfFormField imagefield = acroForm.getField(fieldname);
        // get the appearance dictionary
        PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
        // get the xobject resources
        PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
        for (PdfName key : xObjDic.keySet()) {
            System.out.println(key);
            PdfStream s = xObjDic.getAsStream(key);
            // only process images
            if (PdfName.Image.equals(s.getAsName(PdfName.Subtype))) {  //*** code fails here ***
                PdfImageXObject pixo = new PdfImageXObject(s);
                byte[] imgbytes = pixo.getImageBytes();
                String ext = pixo.identifyImageFileExtension();

                // write the image to file
                String fileName = null;
                FileOutputStream fos = new FileOutputStream(fileName = key.toString().substring(1) + "." + ext);
                System.out.println(("image fileName: " + fileName));
                fos.write(imgbytes);
                fos.close();
            }
        }
        document.close();
    }

The code fails because s.getAsName(PdfName.Subtype) returns the value "Form". I'm guessing that what I need to do is recurse into the XObject tree as you suggest in your post but am not sure just how to do that. I tried xObjDic.getAsDictionary() but am not sure what PdfName to pass in as an argument.

You mention iText without clarifying the version. According to your code I assume it's an iText 7 version? — mkl, Apr 29 '21 at 16:20
@StephenSchultz I adapted my answer, based on your question edit. — rhens, May 04 '21 at 17:46

rhens · Accepted Answer · 2021-05-04T17:45:46.357

The visual appearance of a button in PDF can be fully customized, with text, graphics and images. So, the image data could be stored in a slightly different way in different PDF documents. But generally speaking, the form field's widget annotation will have an appearance stream, which will have the image data as an XObject in its Resources dictionary.

Creating a PDF with a button with image for testing:

String fieldname = "Image1_af_image";
PdfAcroForm form = PdfAcroForm.getAcroForm(pdfDoc, true);
PdfButtonFormField imagefield = PdfFormField.createButton(pdfDoc, new Rectangle(100, 100, 50, 50),
        PdfButtonFormField.FF_PUSH_BUTTON);
imagefield.setImage("button.png").setFieldName(fieldname);
form.addField(imagefield);

Getting the image data from a button:

PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
for (PdfName key : xObjDic.keySet()) {
    System.out.println(key);
    PdfStream s = xObjDic.getAsStream(key);
    // only process images
    if (PdfName.Image.equals(s.getAsName(PdfName.Subtype))) {
        PdfImageXObject pixo = new PdfImageXObject(s);
        byte[] imgbytes = pixo.getImageBytes();
        String ext = pixo.identifyImageFileExtension();
    
        // write the image to file
        FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
        fos.write(imgbytes);
        fos.close();
    }
}

You can use a PDF object viewer, such as iText RUPS or Adobe Acrobat's built-in "Browse Internal PDF Structure", to inspect the exact structure of your PDF document and find out where the image data is stored.

EDIT:

A more generic way of extracting the image data, in case it's in nested Form XObjects:

PdfAcroForm acroForm = PdfAcroForm.getAcroForm(pdfDoc, false);
PdfFormField imagefield = acroForm.getField(fieldname);
// get the appearance dictionary
PdfDictionary apDic = imagefield.getWidgets().get(0).getNormalAppearanceObject();
// get the xobject resources
PdfDictionary xObjDic = apDic.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
extractImagesFromXObj(xObjDic);

public void extractImagesFromXObj(PdfDictionary xObjDic) throws IOException {
    for (PdfName key : xObjDic.keySet()) {
        System.out.println(key);
        PdfStream s = xObjDic.getAsStream(key);
        PdfName subType = s.getAsName(PdfName.Subtype);
        // only process images
        if (PdfName.Image.equals(subType)) {
            PdfImageXObject pixo = new PdfImageXObject(s);
            byte[] imgbytes = pixo.getImageBytes();
            String ext = pixo.identifyImageFileExtension();

            // write the image to file
            FileOutputStream fos = new FileOutputStream(key.toString().substring(1) + "." + ext);
            fos.write(imgbytes);
            fos.close();
        }
        // process nested XObject dictionaries recursively
        else if (PdfName.Form.equals(subType)) {
            PdfDictionary nestedXObjDic = s.getAsDictionary(PdfName.Resources).getAsDictionary(PdfName.XObject);
            extractImagesFromXObj(nestedXObjDic);
        }
    }
}

Depending on the PDF creation software the exact structure of the image differs, the image XObjects might not be immediate resources of the normal appearance stream but instead of a nested form XObject. Thus, in general one also has to recurse into the form XObjects of the appearance stream and look for image XObjects there, too. — mkl, May 04 '21 at 14:47
Indeed, @mkl. I didn't want to complicate my initial answer with that more generic approach. But based on your comment and the edit that was made to the question, I have added some sample code to traverse the nested dictionaries. — rhens, May 04 '21 at 17:51
This is brilliant--works exactly as you described and precisely what I needed--thanks! (I see that what I was missing was to iterate over xObjDic.keyset(). Always nice to learn a new trick!). — Stephen Schultz, May 05 '21 at 01:30
@StephenSchultz, if this solved your problem, please consider accepting the answer [1](https://stackoverflow.com/help/someone-answers) [2](https://stackoverflow.com/help/accepted-answer) [3](https://meta.stackexchange.com/questions/5234/) — rhens, May 05 '21 at 16:16
When I attempt to set the image programmatically using code supplied in the initial answer to my post I get this exception: com.itextpdf.kernel.PdfException: There is no associate PdfWriter for making indirects. at com.itextpdf.kernel.pdf.PdfObject.makeIndirect(PdfObject.java:229. I'm interpreting "button.png" in the code as a resource flle name. Have tried a name relative to the application's path as well as supplying an absolute path. — Stephen Schultz, May 05 '21 at 20:59
After posting this comment I realized that I had indeed failed to configure a reader for this document. That done, the code executes! — Stephen Schultz, May 06 '21 at 00:45

How to Extract Images from a PDF Form with iText

1 Answers1