When an image is placed on OverContent with PdfStamper how can it be found later?

Question

When a barcode image is placed on the pdf using stamper in this manner:

  PdfContentByte page = stamper.GetOverContent(i);
  image.SetAbsolutePosition(x, y);
  page.AddImage(image);

it displays properly when the PDF is rendered in a viewer, but it is not being found by the code below (adapted from here). The code simply doesn't recognize it as existing. The code finds an image that was placed in the Pdf by Acrobat Pro XI, but not the one added in the above manner.

What is the proper way to place a barcode image on a pdf in iTextSharp such that the image will be included in the PdfDictionary? What needs to be changed, the code above, or the code below?

 for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
    {
         PdfDictionary pg = pdf.GetPageN(pageNumber);                  
         PdfObject obj = FindImageInPDFDictionary(pg);
         if (obj != null)
             {
                int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
                 PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
                 PdfStream pdfStrem = (PdfStream)pdfObj;
                 byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)pdfStrem);
                   if ((bytes != null))
                        {
                            using (System.IO.MemoryStream memStream = new System.IO.MemoryStream(bytes))
                            {
                                memStream.Position = 0;
                                System.Drawing.Image img = System.Drawing.Image.FromStream(memStream);
                     // now we have an image and can examine it
                     // to see if it is a barcode               
                            }

                    }
             }

        }

The `FindImageInPDFDictionary` method from the accepted answer to the question you refer to is deficient in many ways. You should use the iText parsing framework instead, cf. [This answer](http://stackoverflow.com/a/24239462/1729265) — mkl, Aug 20 '16 at 19:40
That being said, is that `image` you add really a bitmap image? After all, itext can wrap other stuff as an `Image`, too. — mkl, Aug 20 '16 at 20:35
@mkl: Thanks, I will take a look at that answer. The barcode is actually an iTextSharp.text.Image object. — Tim, Aug 20 '16 at 21:23
An `iTextSharp.text.Image` object can contain a lot of different things. Among them bitmap images but other entities, too. Thus, what does your `image` contain? — mkl, Aug 20 '16 at 23:07
You mentions a bar code. That is most likely an image that consists of *vector data*, not as *raster data*. Vector data is stored as a *form XObject* inside a PDF; although you use the `Image` class, it's not considered an image from the point of view of the PDF. Images in PDF are stored as *Image XObjects*. The parser framework in iText that extracts images from a PDF only looks for Image XObjects, not for Form XObjects. — Bruno Lowagie, Aug 21 '16 at 06:11
Success with `PdfReaderContentParser` and `MyImageRenderListener`. Thank you for the help and for iTextSharp :) — Tim, Aug 22 '16 at 09:44
@Tim Do you want to make that an answer yourself? Or should I do? — mkl, Aug 22 '16 at 11:56
@mkl: Please, you take the credit since you were the one who steered me in the right direction. :) — Tim, Aug 22 '16 at 12:56

score 1 · Accepted Answer · edited May 23 '17 at 12:00

First of all an iText Image object is not necessarily a bitmap image but can also be a wrapper of a form xobject containing e.g. only vector graphics. The extraction code, on the other hand, only considers bitmap images.

In the case at hand, though, it turned out that the image indeed was a bitmap image.

There is nothing special in the way iText adds images to the OverContent, the problem is the FindImageInPDFDictionary method from the accepted answer to the question you refer to:

private static PdfObject FindImageInPDFDictionary(PdfDictionary pg) {
    PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));

    PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
    if (xobj != null) {
        foreach (PdfName name in xobj.Keys) {
            PdfObject obj = xobj.Get(name);
            if (obj.IsIndirect()) {
                PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);

                PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));

                //image at the root of the pdf
                if (PdfName.IMAGE.Equals(type)) {
                    return obj;
                }// image inside a form
                else if (PdfName.FORM.Equals(type)) {
                    return FindImageInPDFDictionary(tg);
                } //image inside a group
                else if (PdfName.GROUP.Equals(type)) {
                    return FindImageInPDFDictionary(tg);
                }
            }
        }
    }
    return null;
}

It is deficient in more than one way:

It only considers the first Image, Form, or Group xobject from the resources of the pg dictionary as it immediately returns in any of these cases not caring whether the recursive call in any of the latter two cases returns a non-null result.
Putting the issue above aside, it inspects the page resources and the resources of the contained form xobjects and groups and nothing else. Thus,
- it doesn't check whether an image resource it found is actually used on the page, so it may return an image which is not at all present on the page,
- it ignores inline images which are contained in the content stream, and
- it ignores images contained in patterns or Type 3 fonts.
It ignores whether the image found has a mask. Sometimes the mask contains the main information of the resulting image while the base image merely determines colors; in particular ink signature images often contain the path of the pen in the mask while the whole base image is filled with the ink color.
It cannot return more than one image per page.

Furthermore, if it is used as in that answer

PdfDictionary pg = pdf.GetPageN(pageNumber);

// recursively search pages, forms and groups for images.
PdfObject obj = FindImageInPDFDictionary(pg);

then only resources immediately associated with the page object are inspected, but resources can alternatively also be inherited from an ancestor node in the page tree.

You should use the iText parsing framework instead, cf. e.g. the answer to "Extract Images from PDF coordinates using iText" or variations thereof (there is a MyImageRenderListener class referenced very often). In particular

it returns all of its finds via callback, not merely a single one per page;
it doesn't ignore some of the images it is made to consider;
it scans the content stream and, therefore, finds inline images and only those resources which are actually used;
it returns the mask of an image if applicable;
as a bonus it returns position and transformation of the image use.

It's not perfect, though: In particular it does not scan patterns and type 3 fonts for images (but the parsing framework allows to try and extract type 3 font uses as text), and it does not look at inherited resources either.

When an image is placed on OverContent with PdfStamper how can it be found later?

1 Answers1