2

I'm trying to extract images from PDF files using iTextSharp.

The process is working for most of PDF files I have but fails with some others.

Especially, I observe that failing PDF has images with filter /ASCIIHexDecode and /CCITTFaxDecode.

How to decode images with this filters?

FYI, my image extraction routine is (pg object is get using PdfReader.GetPageN):

private static FindImages(PdfReader reader, PdfDictionary pdfPage)
{
    var imgPdfObject = FindImageInPDFDictionary(pdfPage);
    foreach (var image in imgPdfObject)
    {
        var xrefIndex = ((PRIndirectReference)image).Number;
        var stream = reader.GetPdfObject(xrefIndex);
        // Exception occurs here :
        var pdfImage = new PdfImageObject((PRStream)stream);
        img = (Bitmap)pdfImage.GetDrawingImage();

        // Do something with the image

    }
}
private static IEnumerable<PdfObject> FindImageInPDFDictionary(PdfDictionary pg)
{
    PdfDictionary res =
        (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));

    PdfDictionary xobj =
      (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
    if (xobj != null)
    {
        foreach (PdfName name in xobj.Keys)
        {
            PdfObject obj = xobj.Get(name);
            if (obj.IsIndirect())
            {
                PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);

                PdfName type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));

                //image at the root of the pdf
                if (PdfName.IMAGE.Equals(type))
                {
                    yield return obj;
                }// image inside a form
                else if (PdfName.FORM.Equals(type))
                {
                    foreach (var nestedObj in FindImageInPDFDictionary(tg))
                    {
                        yield return nestedObj;
                    }
                } //image inside a group
                else if (PdfName.GROUP.Equals(type))
                {
                    foreach (var nestedObj in FindImageInPDFDictionary(tg))
                    {
                        yield return nestedObj;
                    }
                }
            }
        }
    }
}

The exact exception is:

iTextSharp.text.exceptions.InvalidImageException: **Invalid code encountered while decoding 2D group 4 compressed data.**
  à iTextSharp.text.pdf.codec.TIFFFaxDecoder.DecodeT6(Byte[] buffer, Byte[] compData, Int32 startX, Int32 height, Int64 tiffT6Options)
  à iTextSharp.text.pdf.FilterHandlers.Filter_CCITTFAXDECODE.Decode(Byte[] b, PdfName filterName, PdfObject decodeParams, PdfDictionary streamDictionary)
  à iTextSharp.text.pdf.PdfReader.DecodeBytes(Byte[] b, PdfDictionary streamDictionary, IDictionary`2 filterHandlers)
  à iTextSharp.text.pdf.parser.PdfImageObject..ctor(PdfDictionary dictionary, Byte[] samples, PdfDictionary colorSpaceDic)
  à iTextSharp.text.pdf.parser.PdfImageObject..ctor(PRStream stream)
  à MyProject.MyClass.MyMethod(PdfReader reader, PdfDictionary pdfPage) dans c:\\sopmewhere\\PdfProcessor.cs:ligne 161

FYI: here is a sample PDF that is causing trouble: test.pdf

Steve B
  • 36,818
  • 21
  • 101
  • 174
  • Please share the pdf for which the exception occurs. – mkl Nov 03 '17 at 20:21
  • I've updated the repro code that was missing actual failing code, and a sample PDF that is causing trouble. – Steve B Nov 06 '17 at 08:49
  • Your sample file does not contain any stream with **ASCIIHexDecode** filter. You might want to edit this out of your question title as it can mislead people to concentrate on that filter, like Jacek Blaszczynski in his answer. – mkl Nov 06 '17 at 10:04
  • I could reproduce your issue, indeed iText does not recognize the image in question as valid. Unfortunately I'm not that deep into the image formats to tell whether the image data indeed are broken or the iText image decoding code is incomplete. As the format is a TIFF variant, though, both may be true. – mkl Nov 06 '17 at 10:58

1 Answers1

0

Without going very deep into your code sample there are some alternative implementations of PDF filters and in particular a very simple one is the following PDFSharp - AsciiHexDecode.cs. Hope it will help as replacing encoders and decoders implemented in iTextSharp should be straightforward and should allow for verification if data is corrupted or one of decoders/encoders has bugs. Unfortunately I have had no example at hand on /CCITTFaxDecode at the time of writing.

//
// Copyright (c) 2005-2016 empira Software GmbH, Cologne Area (Germany)
//
// http://www.pdfsharp.com
// http://sourceforge.net/projects/pdfsharp
//
// Permission is hereby granted, free of charge, to any person obtaining a
// copy of this software and associated documentation files (the "Software"),
// to deal in the Software without restriction, including without limitation
// the rights to use, copy, modify, merge, publish, distribute, sublicense,
// and/or sell copies of the Software, and to permit persons to whom the
// Software is furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included
// in all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
// THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER 
// DEALINGS IN THE SOFTWARE.
#endregion

using System;

namespace PdfSharp.Pdf.Filters
{
    /// <summary>
    /// Implements the ASCIIHexDecode filter.
    /// </summary>
    public class AsciiHexDecode : Filter
    {
        // Reference: 3.3.1  ASCIIHexDecode Filter / Page 69

        /// <summary>
        /// Encodes the specified data.
        /// </summary>
        public override byte[] Encode(byte[] data)
        {
            if (data == null)
                throw new ArgumentNullException("data");

            int count = data.Length;
            byte[] bytes = new byte[2 * count];
            for (int i = 0, j = 0; i < count; i++)
            {
                byte b = data[i];
                bytes[j++] = (byte)((b >> 4) + ((b >> 4) < 10 ? (byte)'0' : (byte)('A' - 10)));
                bytes[j++] = (byte)((b & 0xF) + ((b & 0xF) < 10 ? (byte)'0' : (byte)('A' - 10)));
            }
            return bytes;
        }

        /// <summary>
        /// Decodes the specified data.
        /// </summary>
        public override byte[] Decode(byte[] data, FilterParms parms)
        {
            if (data == null)
                throw new ArgumentNullException("data");

            data = RemoveWhiteSpace(data);
            int count = data.Length;
            // Ignore EOD (end of data) character.
            // EOD can be anywhere in the stream, but makes sense only at the end of the stream.
            if (count > 0 && data[count - 1] == '>')
                --count;
            if (count % 2 == 1)
            {
                count++;
                byte[] temp = data;
                data = new byte[count];
                temp.CopyTo(data, 0);
            }
            count >>= 1;
            byte[] bytes = new byte[count];
            for (int i = 0, j = 0; i < count; i++)
            {
                // Must support 0-9, A-F, a-f - "Any other characters cause an error."
                byte hi = data[j++];
                byte lo = data[j++];
                if (hi >= 'a' && hi <= 'f')
                    hi -= 32;
                if (lo >= 'a' && lo <= 'f')
                    lo -= 32;
                // TODO Throw on invalid characters. Stop when encountering EOD. Add one more byte if EOD is the lo byte.
                bytes[i] = (byte)((hi > '9' ? hi - '7'/*'A' + 10*/: hi - '0') * 16 + (lo > '9' ? lo - '7'/*'A' + 10*/: lo - '0'));
            }
            return bytes;
        }
    }
}
Jacek Blaszczynski
  • 3,183
  • 14
  • 25
  • It is highly improbable that the issue is due to a problem in the AsciiHexDecoder of iText. If the issue is due to a problem in iText at all (and not broken image data), it more likely would be a limitation of the `TIFFFaxDecoder`. Have you, before answering, checked that class at all? – mkl Nov 04 '17 at 08:42
  • Thanks @Jacek, but I dont think my issue is related to the filter itself, but to the logic in my code. I was expecting that the method `PdfImage.GetDrawingImage` would handle that, as I've read many times on some blog posts. – Steve B Nov 06 '17 at 08:45