Exctract FlateDecode images using iTextSharp

Question

I want to extract images from an PDF. I'm using iTextSharp right now. Some images can be extracted correct, but most of them don't have the right colors and are distorted. I did some experiments with different PixelFormats, but I didn't get a solution for my Problem...

This is the Code which separates the image-types:

if (filter == "/FlateDecode")
{
   // ...
   int w = int.Parse(width);
   int h = int.Parse(height);
   int bpp = tg.GetAsNumber(PdfName.BITSPERCOMPONENT).IntValue;

   byte[] rawBytes = PdfReader.GetStreamBytesRaw((PRStream)tg);
   byte[] decodedBytes = PdfReader.FlateDecode(rawBytes);
   byte[] streamBytes = PdfReader.DecodePredictor(decodedBytes, tg.GetAsDict(PdfName.DECODEPARMS));

   PixelFormat[] pixFormats = new PixelFormat[23] { 
         PixelFormat.Format24bppRgb,
         // ... all Pixel Formats
    };
    for (int i = 0; i < pixFormats.Length; i++)
    {
        Program.ToPixelFormat(w, h, pixFormats[i], streamBytes, bpp, images));
    }
}

This is the Code to save the Image in a MemoryStream. Saving the image in a folder is implemented later.

private static void ToPixelFormat(int width, int height, PixelFormat pixelformat, byte[] bytes, int bpp, IList<Image> images)
{
    Bitmap bmp = new Bitmap(width, height, pixelformat);
    BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, width, height),
       ImageLockMode.WriteOnly, pixelformat);
    Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);
    bmp.UnlockBits(bmd);
    using (var ms = new MemoryStream())
    {
       bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Tiff);
       bytes = ms.GetBuffer();
    }
    images.Add(bmp);
}

Please help me.

Check out this response using some new features in 5.1.3 and greater: http://stackoverflow.com/a/8511314/231316 — Chris Haas, Apr 05 '12 at 14:27
It's right, that solution might work (the first of the examples). But the colors are still inverse or distorted. Thanks for your reply. — der_chirurg, Apr 05 '12 at 14:45

score 3 · Answer 1 · edited Dec 13 '12 at 22:15

even you found solution for your problem, let me say suggestion to fix your code above.

I believe the distortion problem is caused because of mismatch in row data boundary. PdfReader returns data in a byte boundary. For example for grayscale image 20 pixel wide you will get 20 bytes of data for each image row. Bitmap class works with 32bit boundary. When creating bitmap with 20 pixels of width, Bitmap class will generate grayscale bitmap with stride(byte width)=32 bytes. It means you cannot simply copy the retrieved bytes from PdfReader into a new bitmap using Marshal.Copy() method as it is in your ToPixelFormat().

First pixel in source byte array is located as 21st byte but destination Bitmap needs it as 33rd byte becasue of the 32bit boundary of the Bitmap. To solve this issue I had to create byte array with size that considers the 32bit boundary for each data row.

Copy data row by row from bytes aray retrieved from PdfReader into new byte array with 32bit row boundary consideration. Now I had bytes of data with boundary that matched the Bitmap class boundary so I can copy it to the new Bitmap using Marshal.Copy().

score 2 · Accepted Answer · answered Apr 10 '12 at 07:28

2

I found an solution for my own problem. To extract all Images on all Pages, it is not necessary to implement different filters. iTextSharp has an Image Renderer, which saves all Images in their original image type.

Just do the following found here: http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx You don't need to implement HttpHandler...

answered Apr 10 '12 at 07:28

der_chirurg

1,475
2
16
26

The site is not responding for me, but a snapshot is available via the Wayback Machine: https://web.archive.org/web/20160714220626/http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx – Cal Jacobson Aug 15 '17 at 15:55

score 1 · Answer 3 · answered Apr 05 '12 at 14:07

PDF supports a pretty wide variety of image formats. I don't think I would take this approach you've chosen here. You need to determine the image format from the bytes in the stream itself. For example, JPEG will typically start with the ASCII bytes JFIF.

.NET (3.0+) does come with a method that will attempt to pick the right decoder: BitmapDecoder.Create. See http://msdn.microsoft.com/en-us/library/system.windows.media.imaging.bitmapdecoder.aspx

If that doesn't work you may want to consider some third-party imaging libraries. I've used ImageMagick.NET and LeadTools (way overpriced).

Exctract FlateDecode images using iTextSharp

3 Answers3