I extract the pictures found in a PDF document with itextsharp using this snippet (thanks @Scott Stanford from this topic) :
private static IList<System.Drawing.Image> GetImagesFromPdfDict(PdfDictionary dict, PdfReader doc)
{
List<Image> images = new List<Image>();
if (dict == null)
return images;
PdfDictionary res = (PdfDictionary)(PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES)));
if (res == null)
return images;
PdfDictionary xobj = (PdfDictionary)(PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)));
if (xobj == null)
return images;
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)(PdfReader.GetPdfObject(obj));
PdfName subtype = (PdfName)(PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)));
if (PdfName.IMAGE.Equals(subtype))
{
int xrefIdx = ((PRIndirectReference)obj).Number;
PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
PdfStream str = (PdfStream)(pdfObj);
iTextSharp.text.pdf.parser.PdfImageObject pdfImage =
new iTextSharp.text.pdf.parser.PdfImageObject((PRStream)str);
System.Drawing.Image img = pdfImage.GetDrawingImage();
images.Add(img);
}
else if (PdfName.FORM.Equals(subtype) || PdfName.GROUP.Equals(subtype))
{
images.AddRange(GetImagesFromPdfDict(tg, doc));
}
}
}
return images;
}
Then I save the extracted System.Drawing.Image into jpeg files like this :
image.Save(path, ImageFormat.Jpeg);
This works well for most pictures, but in some rare cases, the saved pictures look like this :
(I have added the black stroke after the generation of the image because these pictures concern real people).
The white color turns into pink, and the black colors turn into green shades.
I tried to save the System.Drawing.Image with several encodings (System.Drawing.Imaging.EncoderParameter, also with PNG...) but I did not managed to change its output. So I think this problem come from the extraction of the image from the PDF and the creation of the System.Drawing.Image.
To test if the pictures are not corrupted, I tried with the online PDF extractor http://www.extractpdf.com/. This tool managed to extract these pictures without any problem.
Does anybody have an idea to solve this issue ?