I'm trying to extract images from PDF files using iTextSharp.
The process is working for most of PDF files I have but fails with some others.
Especially, I observe that failing PDF has images with filter /ASCIIHexDecode
and /CCITTFaxDecode
.
How to decode images with this filters?
FYI, my image extraction routine is (pg
object is get using PdfReader.GetPageN
):
private static FindImages(PdfReader reader, PdfDictionary pdfPage)
{
var imgPdfObject = FindImageInPDFDictionary(pdfPage);
foreach (var image in imgPdfObject)
{
var xrefIndex = ((PRIndirectReference)image).Number;
var stream = reader.GetPdfObject(xrefIndex);
// Exception occurs here :
var pdfImage = new PdfImageObject((PRStream)stream);
img = (Bitmap)pdfImage.GetDrawingImage();
// Do something with the image
}
}
private static IEnumerable<PdfObject> FindImageInPDFDictionary(PdfDictionary pg)
{
PdfDictionary res =
(PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj =
(PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
//image at the root of the pdf
if (PdfName.IMAGE.Equals(type))
{
yield return obj;
}// image inside a form
else if (PdfName.FORM.Equals(type))
{
foreach (var nestedObj in FindImageInPDFDictionary(tg))
{
yield return nestedObj;
}
} //image inside a group
else if (PdfName.GROUP.Equals(type))
{
foreach (var nestedObj in FindImageInPDFDictionary(tg))
{
yield return nestedObj;
}
}
}
}
}
}
The exact exception is:
iTextSharp.text.exceptions.InvalidImageException: **Invalid code encountered while decoding 2D group 4 compressed data.**
à iTextSharp.text.pdf.codec.TIFFFaxDecoder.DecodeT6(Byte[] buffer, Byte[] compData, Int32 startX, Int32 height, Int64 tiffT6Options)
à iTextSharp.text.pdf.FilterHandlers.Filter_CCITTFAXDECODE.Decode(Byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
à iTextSharp.text.pdf.PdfReader.DecodeBytes(Byte[] b, PdfDictionary streamDictionary, IDictionary`2 filterHandlers)
à iTextSharp.text.pdf.parser.PdfImageObject..ctor(PdfDictionary dictionary, Byte[] samples, PdfDictionary colorSpaceDic)
à iTextSharp.text.pdf.parser.PdfImageObject..ctor(PRStream stream)
à MyProject.MyClass.MyMethod(PdfReader reader, PdfDictionary pdfPage) dans c:\\sopmewhere\\PdfProcessor.cs:ligne 161
FYI: here is a sample PDF that is causing trouble: test.pdf