0

I extract the pictures found in a PDF document with itextsharp using this snippet (thanks @Scott Stanford from this topic) :

    private static IList<System.Drawing.Image> GetImagesFromPdfDict(PdfDictionary dict, PdfReader doc)
    {
        List<Image> images = new List<Image>();

        if (dict == null)
            return images;

        PdfDictionary res = (PdfDictionary)(PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES)));

        if (res == null)
            return images;

        PdfDictionary xobj = (PdfDictionary)(PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)));

        if (xobj == null)
            return images;

        foreach (PdfName name in xobj.Keys)
        {
            PdfObject obj = xobj.Get(name);
            if (obj.IsIndirect())
            {
                PdfDictionary tg = (PdfDictionary)(PdfReader.GetPdfObject(obj));
                PdfName subtype = (PdfName)(PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)));
                if (PdfName.IMAGE.Equals(subtype))
                {
                    int xrefIdx = ((PRIndirectReference)obj).Number;
                    PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
                    PdfStream str = (PdfStream)(pdfObj);

                    iTextSharp.text.pdf.parser.PdfImageObject pdfImage =
                        new iTextSharp.text.pdf.parser.PdfImageObject((PRStream)str);

                    System.Drawing.Image img = pdfImage.GetDrawingImage();

                    images.Add(img);
                }
                else if (PdfName.FORM.Equals(subtype) || PdfName.GROUP.Equals(subtype))
                {
                    images.AddRange(GetImagesFromPdfDict(tg, doc));
                }
            }
        }


        return images;
    }

Then I save the extracted System.Drawing.Image into jpeg files like this :

image.Save(path, ImageFormat.Jpeg);

This works well for most pictures, but in some rare cases, the saved pictures look like this : people1 people2

(I have added the black stroke after the generation of the image because these pictures concern real people).

The white color turns into pink, and the black colors turn into green shades.

I tried to save the System.Drawing.Image with several encodings (System.Drawing.Imaging.EncoderParameter, also with PNG...) but I did not managed to change its output. So I think this problem come from the extraction of the image from the PDF and the creation of the System.Drawing.Image.

To test if the pictures are not corrupted, I tried with the online PDF extractor http://www.extractpdf.com/. This tool managed to extract these pictures without any problem.

Does anybody have an idea to solve this issue ?

Community
  • 1
  • 1
bviale
  • 5,245
  • 3
  • 28
  • 48
  • 1
    You'll have to show us the PDF. – Paulo Soares Feb 09 '16 at 15:09
  • @PauloSoares Unfortunately I can't upload the PDF as is as it contains confidential data, I'm looking for leads I can try on my own. But I tried to modify the PDF to remove critical information to show you guys, but I was able to extract the image correctly on the resulting PDF... – bviale Feb 09 '16 at 15:51

0 Answers0