Process images extracted with PdfPig

Question

Images extracted using PdfPig are the type of XObject Image or InlineImage (both inherit from IPdfImage). I would like to save and display them in a simple WPF application. In order to do so, I would need to have them in more accessible form, for example BitmapImage format. What is the correct way to achieve that? Library documentation does not help here and my miserable attempts were unsuccessful.

score 2 · Answer 1 · edited Dec 28 '20 at 22:46

2

I haven't tested any of this, but it should at least put you on the right path if it doesn't work.

Looking at the PdfPig source on GitHub I can see both XObjectImage and InlineImage have a function TryGetPng. From the looks of it, I would assume that this byte array would match up with the contents of a normal PNG file, which means you should be able to load it straight into a BitmapImage.

Taking some code from this answer. Something like this might work:

InlineImage pdfImage;
byte[] png;

if (pdfImage.TryGetPng(out png))
{
    var bitmap = (BitmapSource)new ImageSourceConverter().ConvertFrom(png);
}

Note: both classes also have a TryGetBytes method, which might work in place of TryGetPng. I'm just not sure what format the output of TryGetBytes is in, so I'd be more confident with TryGetPng. Still, I'd try both if one doesn't work.

edited Dec 28 '20 at 22:46

Clemens

123,504
12
155
268

answered Dec 28 '20 at 22:44

Keith Stein

6,235
4
17
36

2

Both methods `TryGetPng` and `TryGetBytes` return false for every pdf with images I have tried. However, `pdfImage.RawBytes` in combination with solution you proposed work fine for some images (JPG I suppose). Often I am getting `System.NotSupportedException: 'No imaging component suitable to complete this operation was found.'` though – radoslawik Dec 30 '20 at 10:39
@radoslawik `TryGetPng` attempts to convert the bytes to a valid PNG image and `TryGetBytes` removes any filters to give the raw (PDF format) bitmap data of the image. The exception is JPG images where the raw bytes are the JPG data (this is why JPG works). I'm not sure why you're getting `false` for so many documents when using `TryGetPng`, are you able to raise an issue in the repository with example files please? It's likely either due to color spaces or the CCITT fax filter. – Underscore Apr 07 '21 at 16:26
@Underscore: Thank you for that information, but it's still unclear how to extract an arbitrary image, since there are three competing methods (`TryGetPng`, `TryGetBytes`, and `RawBytes`). It would be much more helpful if there was a single method that returned, say, a `System.Drawing.Image`. – Brian Berns Jun 02 '21 at 14:51
@brianberns I agree the current API is sub-optimal. The problem is I wanted to avoid any dependency on an image library, e.g. System.Drawing.Image or SixLabors.ImageSharp or BigGustave (my own) or SkiaSharp, since a dependency makes the library less useful for some groups of users. TryGetBytes - Advanced, Un-apply any PDF filters and return the bytes as stored in the PDF which must be interpreted using ColorSpace RawBytes - Bytes with all PDF filters still applied, only valid for JPGs TryGetPng - Takes TryGetBytes and adds extra logic to generate a PNG by interpreting the ColorSpace – Underscore Jun 03 '21 at 12:43
Image support is still in progress, hence the availability of TryGetBytes. PDF supports some stupid number of different ColorSpaces which are very involved to decode and apply to retrieve a normal Bitmap. A colorspace is basically some advanced function to map from bytes -> RGB but it gets a lot more complex more quickly. By exposing TryGetBytes consumers with very specific colorspace needs are unblocked by implementing their required transform. Work to support all colorspaces for TryGetPng is ongoing. Until all are supported there are some images we just can't retrieve. – Underscore Jun 03 '21 at 12:46

score 0 · Answer 2 · answered Jun 02 '21 at 14:57

FWIW, by trial and error, my current approach is to start with TryGetPng and fall back to RawBytes if it fails. I then interpret the extracted bytes as a System.Drawing.Image. I don't use TryGetBytes at all. Here's my code (F#, but should be easy to convert to C#):

let bytes =
    match pdfImage.TryGetPng() with
        | true, bytes -> bytes
        | _ -> Seq.toArray pdfImage.RawBytes
use stream = new MemoryStream(bytes)
use image = Image.FromStream(stream)

score 0 · Answer 3 · answered Dec 28 '21 at 11:48

I find the following code for me works in most cases. It simply tries all three options available to extract an image (TryGetPng, TryGetBytes and rawBytes) and converts those to an BmpSource.

    private static BitmapSource TryGetImage(IPdfImage image)
    {
        BitmapSource bmp;
        byte[] bytes;
        if (image.TryGetPng(out bytes))
        {
            bmp = (BitmapSource)new ImageSourceConverter().ConvertFrom(bytes);
            Debug.WriteLine("Converted using TryGetPng.");
        }
        else
        {
            IReadOnlyList<byte> iroBytes;
            if (image.TryGetBytes(out iroBytes))
            {
                bmp = (BitmapSource)new ImageSourceConverter().ConvertFrom(bytes);
                Debug.WriteLine("Converted using TryGetBytes.");
            }
            else
            {
                var rawB=image.RawBytes.ToArray<Byte>();
                Bitmap nbmp;
                using (var ms = new MemoryStream(rawB))
                {
                    nbmp = new Bitmap(ms);
                }
                bmp = ConvertBmpToBmpSource(nbmp);
                Debug.WriteLine("Converted using RawBytes.");
            }
        }
        return bmp;
    }

    public static BitmapSource ConvertBmpToBmpSource(Bitmap bitmap)
    {
        var bitmapData = bitmap.LockBits(
            new Rectangle(0, 0, bitmap.Width, bitmap.Height),
            System.Drawing.Imaging.ImageLockMode.ReadOnly, bitmap.PixelFormat);

        var bitmapSource = BitmapSource.Create(
            bitmapData.Width, bitmapData.Height,
            bitmap.HorizontalResolution, bitmap.VerticalResolution,
            PixelFormats.Bgr24, null,
            bitmapData.Scan0, bitmapData.Stride * bitmapData.Height, bitmapData.Stride);

        bitmap.UnlockBits(bitmapData);

        return bitmapSource;
    }

Process images extracted with PdfPig

3 Answers3