4

When decoding an image within a PDF as FlateDecode via iTextSharp the image is distorted and I can't seem to figure out why.

The recognized bpp is Format1bppIndexed. If I modify the PixelFormat to Format4bppIndexed the image is recognizable to some degree (shrunk, coloring is off but readable) and is duplicated 4 times in a horizontal manner. If I adjust the pixel format to Format8bppIndexed it is also recognizable to some degree and is duplicated 8 times in a horizontal manner.

The image below is after a Format1bppIndexed pixel format approach. Unfortunately I am unable to show the others due to security constraints.

distorted image

The code is seen below which is essentially the single solution I have come across littered around both SO and the web.

int xrefIdx = ((PRIndirectReference)obj).Number;
PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
PdfStream str = (PdfStream)(pdfObj);
byte[] bytes = PdfReader.GetStreamBytesRaw((PRStream)str);

string filter = ((PdfArray)tg.Get(PdfName.FILTER))[0].ToString();
string width = tg.Get(PdfName.WIDTH).ToString();
string height = tg.Get(PdfName.HEIGHT).ToString();
string bpp = tg.Get(PdfName.BITSPERCOMPONENT).ToString();

if (filter == "/FlateDecode")
{
   bytes = PdfReader.FlateDecode(bytes, true);

   System.Drawing.Imaging.PixelFormat pixelFormat;
   switch (int.Parse(bpp))
   {
      case 1:
         pixelFormat = System.Drawing.Imaging.PixelFormat.Format1bppIndexed;
         break;
      case 8:
         pixelFormat = System.Drawing.Imaging.PixelFormat.Format8bppIndexed;
         break;
      case 24:
         pixelFormat = System.Drawing.Imaging.PixelFormat.Format24bppRgb;
         break;
      default:
         throw new Exception("Unknown pixel format " + bpp);
   }

   var bmp = new System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat);
   System.Drawing.Imaging.BitmapData bmd = bmp.LockBits(new System.Drawing.Rectangle(0, 0, Int32.Parse(width),
             Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat);
   Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length);
   bmp.UnlockBits(bmd);
   bmp.Save(@"C:\temp\my_flate_picture-" + DateTime.Now.Ticks.ToString() + ".png", ImageFormat.Png);
}

What do I need to do to so that my image extraction works as desired when dealing with FlateDecode?

NOTE: I do not want to use another library to extract the images. I am looking for a solution leveraging ONLY iTextSharp and the .NET FW. If a solution exists via Java (iText) and is easily portable to .NET FW bits that would suffice as well.

UPDATE: The ImageMask property is set to true, which would imply that there is no color space and is therefore implicitly black and white. With the bpp coming in at 1, the PixelFormat should be Format1bppIndexed which as mentioned earlier, produces the embedded image seen above.

UPDATE: To get the image size I extracted it out using Acrobat X Pro and the image size for this particular example was listed as 2403x3005. When extracting via iTextSharp the size was listed as 2544x3300. I modified the image size within the debugger to mirror 2403x3005 however upon calling Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length); I get an exception raised.

Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

My assumption is that this is due to the modification of the size and thus no longer corresponding to the byte data that is being used.

UPDATE: Per Jimmy's recommendation, I verified that calling PdfReader.GetStreamBytes returns a byte[] length equal to widthheight/8 since GetStreamBytes should be calling FlateDecode. Manually calling FlateDecode and calling PdfReader.GetStreamBytes both produced a byte[] length of 1049401, while the widthheight/8 is 2544*3300/8 or 1049400, so there is a difference of 1. Not sure if this would be the root cause or not, an off by one; however I am not sure how to resolve if that is indeed the case.

UPDATE: In trying the approach mentioned by kuujinbo I am met with an IndexOutOfRangeException when I attempt to call renderInfo.GetImage(); within the RenderImage listener. The fact that the width*height/8 as stated earlier is off by 1 in comparison to the byte[] length when calling FlateDecode makes me think these are all one in the same; however a solution still eludes me.

   at System.util.zlib.Adler32.adler32(Int64 adler, Byte[] buf, Int32 index, Int32 len)
   at System.util.zlib.ZStream.read_buf(Byte[] buf, Int32 start, Int32 size)
   at System.util.zlib.Deflate.fill_window()
   at System.util.zlib.Deflate.deflate_slow(Int32 flush)
   at System.util.zlib.Deflate.deflate(ZStream strm, Int32 flush)
   at System.util.zlib.ZStream.deflate(Int32 flush)
   at System.util.zlib.ZDeflaterOutputStream.Write(Byte[] b, Int32 off, Int32 len)
   at iTextSharp.text.pdf.codec.PngWriter.WriteData(Byte[] data, Int32 stride)
   at iTextSharp.text.pdf.parser.PdfImageObject.DecodeImageBytes()
   at iTextSharp.text.pdf.parser.PdfImageObject..ctor(PdfDictionary dictionary, Byte[] samples)
   at iTextSharp.text.pdf.parser.PdfImageObject..ctor(PRStream stream)
   at iTextSharp.text.pdf.parser.ImageRenderInfo.PrepareImageObject()
   at iTextSharp.text.pdf.parser.ImageRenderInfo.GetImage()
   at cyos.infrastructure.Core.MyImageRenderListener.RenderImage(ImageRenderInfo renderInfo)

UPDATE: Trying varying the varying methods listed here in my original solution as well as the solution posed by kuujinbo with a different page in the PDF produces imagery; however the issues always surface when the the filter type is /FlateDecode and no image is produced for that given instance.

Aaron McIver
  • 24,527
  • 5
  • 59
  • 88
  • How is the image distorted? Can you post a screenshot? It sounds like you've got the stride wrong somewhere or are multiplying things up incorrectly. – ChrisF Dec 13 '11 at 17:35
  • Is this related to this question? http://stackoverflow.com/questions/757265/how-does-pdfs-bitspercomponent-translate-to-bits-per-pixel-for-images If not I'll try to dig in a little deeper when I get a chance – Chris Haas Dec 13 '11 at 19:32
  • It does look like you're getting the stride wrong. Hmm, when I used to do image reading on a regular basis this sort of thing would happen all the time. You need to check that the `width`, `height` and `bpp` values are what you expect them to be. Try changing them in the debugger until you get the right result and then work backwards to what you're reading out of the file. – ChrisF Dec 13 '11 at 19:54
  • @ChrisHaas I'm not sure that it is. It looks like it could be however everything I know up to this point says that the `PixelFormat` specified is accurate yet the result is not. – Aaron McIver Dec 13 '11 at 20:00
  • @ChrisF Attempted to do what you suggested and was met with an exception; update posted in question. – Aaron McIver Dec 13 '11 at 20:27
  • I know you have security concerns but can you post a sample PDF of a less-sensitive document? – Chris Haas Dec 13 '11 at 20:43
  • Have you considered the /ColorSpace anywhere? It is possible that you may need to group together between 1 and 4 samples for each pixel colour value. – Jimmy Dec 13 '11 at 21:16
  • @Jimmy The ImageMask is marked as true, ColorSpace is there null and doe snot exist, implying black/white. – Aaron McIver Dec 13 '11 at 22:02
  • @ChrisHaas Are you looking for a screenshot of the original PDF document or with the varying PixelFormat's used? – Aaron McIver Dec 13 '11 at 22:04
  • if that is the case then bitsperpixel shall be 1 (8.9.6.2Stencil Masking) - also are you sure that you need to check that the filter is /FlateDecode - I generally just do PdfReader.GetStreamBytes() and let iText do the correct decoding. – Jimmy Dec 13 '11 at 22:25
  • @Jimmy The bpp is 1 and is set as such. The filter is coming in as /FlateDecode. How do you write the image to disk if you don't know the type? What encoder do you use? – Aaron McIver Dec 13 '11 at 22:36
  • 1
    I'm assuming GetStreamBytes will call FlateDecode on your behalf. Have you checked bytes.length equals (height*width/8)? I would believe width and height in the image dictionary (rather than the value that Adobe Acrobat is giving). I have done something similar recently, but because I've been using Java i've not had the memory access that you have in C# (i've had plot pixels individually). Hope you get it sorted. – Jimmy Dec 13 '11 at 23:28
  • @Jimmy Manually calling FlateDecode produces a byte[] length of 1049401, while the width*height/8 is 2544*3300/8 or 1049400, so there is a difference of 1. Not sure if this would be the root cause or not, an off by one; however I am not sure how to resolve if that is indeed the case. – Aaron McIver Dec 14 '11 at 15:27
  • It might cause an error if you copy one byte too many using "Marshal.Copy" and pass in the array length (you might be overwriting an importatnt single byte of image meta data). – Jimmy Dec 14 '11 at 15:54
  • @Jimmy Just confirmed that calling GetStreamBytes which should call FlateDecode as you mentioned is indeed returning the same byte[] size as manually calling FlateDecode, 1049401. I will update the question accordingly; still at a loss on how to solve. – Aaron McIver Dec 14 '11 at 16:44
  • @Aaron have you tried Marshal.Copy(bytes, 0, bmd.Scan0, length - 1 ); or similar? I think it would be good to rule that out. The pixel format should be identical between pdf and BitmapData - otherwise you might want to try setPixel inside a loop (instead of 'unsafe' memory copies ;-)to fill in the bitmap - beyond that - I haven't a clue either - – Jimmy Dec 14 '11 at 16:56
  • @Jimmy In trying to copy the data via length - 1 the outcome was identical. Would a screenshot of 4bppIndexed help any? The image becomes legible (smaller then it should be), colors appear inverted (background is black when it should be white), and the _image_ is multiplied 4 times over. – Aaron McIver Dec 14 '11 at 20:10

2 Answers2

10

Try copy your data row by row, maybe it will solve the problem.

int w = imgObj.GetAsNumber(PdfName.WIDTH).IntValue;
int h = imgObj.GetAsNumber(PdfName.HEIGHT).IntValue;
int bpp = imgObj.GetAsNumber(PdfName.BITSPERCOMPONENT).IntValue;
var pixelFormat = PixelFormat.Format1bppIndexed;

byte[] rawBytes = PdfReader.GetStreamBytesRaw((PRStream)imgObj);
byte[] decodedBytes = PdfReader.FlateDecode(rawBytes);
byte[] streamBytes = PdfReader.DecodePredictor(decodedBytes, imgObj.GetAsDict(PdfName.DECODEPARMS));
// byte[] streamBytes = PdfReader.GetStreamBytes((PRStream)imgObj); // same result as above 3 lines of code.

using (Bitmap bmp = new Bitmap(w, h, pixelFormat))
{
    var bmpData = bmp.LockBits(new Rectangle(0, 0, w, h), ImageLockMode.WriteOnly, pixelFormat);
    int length = (int)Math.Ceiling(w * bpp / 8.0);
    for (int i = 0; i < h; i++)
    {
        int offset = i * length;
        int scanOffset = i * bmpData.Stride;
        Marshal.Copy(streamBytes, offset, new IntPtr(bmpData.Scan0.ToInt32() + scanOffset), length);
    }
    bmp.UnlockBits(bmpData);

    bmp.Save(fileName);
}
bigsan
  • 756
  • 6
  • 12
  • sounds like a very good idea to try, seems like the stride is rounded up to 4 byte boundary http://msdn.microsoft.com/en-us/library/system.drawing.imaging.bitmapdata.stride.aspx – Jimmy Dec 15 '11 at 16:59
  • @bigsan It worked! The colors are inverted, black is white and white is black but otherwise it worked perfectly. – Aaron McIver Dec 16 '11 at 18:31
  • 1
    @AaronMcIver the inversion problem is due to a bug in `DecodePredictor` method. To correct this, there is another implementation in pdfbox project, see [line 202~324](http://www.docjar.com/html/api/org/apache/pdfbox/filter/FlateFilter.java.html). – bigsan Dec 17 '11 at 06:03
  • as the image is a stencil mask you also need to consider the /Decode key to know how to interpret the colour. glad you got it sorted – Jimmy Dec 17 '11 at 09:15
  • Thanks bigsan, this helped me considerably. I encountered an overflow error, but it was resolved by changing bmpData.Scan0.ToInt32() to bmpData.Scan0.ToInt64(). – Cal Jacobson Aug 15 '17 at 16:42
1

If you're able to use the latest version (5.1.3), the API to extract FlateDecode and other image types has been simplified using the iTextSharp.text.pdf.parser namespace. Basically you use a PdfReaderContentParser to help you parse the PDF document, then you implement the IRenderListener interface specifically (in this case) to deal with images. Here's a working example HTTP handler:

<%@ WebHandler Language="C#" Class="bmpExtract" %>
using System;
using System.Collections.Generic;
using System.IO;
using System.Web;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

public class bmpExtract : IHttpHandler {
  public void ProcessRequest (HttpContext context) {
    HttpServerUtility Server = context.Server;
    HttpResponse Response = context.Response;
    PdfReader reader = new PdfReader(Server.MapPath("./bmp.pdf"));
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    MyImageRenderListener listener = new MyImageRenderListener();
    for (int i = 1; i <= reader.NumberOfPages; i++) {
      parser.ProcessContent(i, listener);
    } 
    for (int i = 0; i < listener.Images.Count; ++i) {
      string path = Server.MapPath("./" + listener.ImageNames[i]);
      using (FileStream fs = new FileStream(
        path, FileMode.Create, FileAccess.Write
      ))
      {
        fs.Write(listener.Images[i], 0, listener.Images[i].Length);
      }
    }         
  }
  public bool IsReusable { get { return false; } }

  public class MyImageRenderListener : IRenderListener {
    public void RenderText(TextRenderInfo renderInfo) { }
    public void BeginTextBlock() { }
    public void EndTextBlock() { }

    public List<byte[]> Images = new List<byte[]>();
    public List<string> ImageNames = new List<string>();
    public void RenderImage(ImageRenderInfo renderInfo) {
      PdfImageObject image = null;
      try {
        image = renderInfo.GetImage();
        if (image == null) return;

        ImageNames.Add(string.Format(
          "Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType()
        ));
        using (MemoryStream ms = new MemoryStream(image.GetImageAsBytes())) {
          Images.Add(ms.ToArray());
        }
      } 
      catch (IOException ie) {
/*
 * pass-through; image type not supported by iText[Sharp]; e.g. jbig2
*/
      }
    }
  }
}

The iText[Sharp] development team is still working on the implementation, so I can't say for sure if it will work in your case. But it does work on this simple example PDF. (used above and with a couple of other PDFs I tried with bitmap images)

EDIT: I've been experimenting with the new API too and made a mistake in the original code example above. Should have initialized the PdfImageObject to null outside the try..catch block. Correction made above.

Also, when I use the above code on an unsupported image type, (e.g. jbig2) I get a different Exception - "The color depth XX is not supported", where "XX" is a number. And iTextSharp does support FlateDecode in all the examples I've tried. (but that's not helping you in this case, I know)

Is the PDF produced by third-party software? (non-Adobe) From what I've read in the book, some third-party vendors produce PDFs that aren't completely up to spec, and iText[Sharp] can't deal with some of these PDFs, while Adobe products can. IIRC I've seen cases specific to some PDFs generated by Crystal Reports on the iText mailing list that caused problems, here's one thread.

Is there any way you can generate a test PDF with the software you're using with some non-sensitive FlateDecode image(s)? Then maybe someone here could help a little better.

kuujinbo
  • 9,272
  • 3
  • 44
  • 57
  • I had tried this initially but recall running into an issue. I just gave it a spin again and when I call renderInfo.GetImage() an `IndexOutOfRangeException` surfaces. I have updated the question accordingly. The fact that the initial byte[] length was off by one when taking the width*height/8 of the image makes me a bit suspect that this may be the cause of the exception as well. – Aaron McIver Dec 14 '11 at 21:21
  • Updated code example and added some additional thoughts. See above. – kuujinbo Dec 15 '11 at 06:15
  • I attempted to use bigsan's solution first and it worked. As noted in the comment, the colors are inverted but otherwise it's perfect. – Aaron McIver Dec 16 '11 at 18:32
  • Thanks for the update. If you don't mind, I'd still like to know if the PDF was produced by a 3rd party vendor or Adobe. It's also strange that only _some_ of the `/FlateDecode` images throw an Exception. If you can create a test file with non-sensitive data and either post a similar question on the mailing list or on the [project page] (http://sourceforge.net/projects/itextsharp/) as a possible bug, (in regards to the new parsing APIs) I think it would help both the development team and also the iTextSharp user community too. Your question was interesting :) – kuujinbo Dec 16 '11 at 20:39
  • It's a very mixed bag with regard to the PDF. Portions of the pages/overlays are created using xpression whereas other portions are created using a tool called Doc1. The resulting format in either case is AFP. There is now another tool taking the place of xpression which I do not know the name of that is used to build the overlays. This data is all merged within a PDF, which I believe is not actually generated via Adobe software; however I am not 100% certain of that. It's a mash up to say the least. The Exception was surfacing when I was adjusting the byte data. – Aaron McIver Dec 19 '11 at 16:31