How to extract images from pdf using iText7 c#

Question

Below approach i have used to extract images from pdf. But sub type is always giving null. I am working with iText7 library which is new version. If any body worked with new library please give suggestions.

    public static string ExtractImageFromPDF(string sourcePdf)
    {            
        PdfReader reader = new PdfReader(sourcePdf);
        try
        {
            PdfDocument document = new PdfDocument(reader);

            for (int pageNumber = 1; pageNumber <= document.GetNumberOfPages(); pageNumber++)
            {
                PdfDictionary obj = (PdfDictionary)document.GetPdfObject(pageNumber);

                if (obj != null && obj.IsStream())
                {
                    PdfDictionary pd = (PdfDictionary)obj;
                    if (pd.ContainsKey(PdfName.Subtype) && pd.Get(PdfName.Subtype).ToString() == "/Image")
                    {
                        string filter = pd.Get(PdfName.Filter).ToString();
                        string width = pd.Get(PdfName.Width).ToString();
                        string height = pd.Get(PdfName.Height).ToString();
                        string bpp = pd.Get(PdfName.BitsPerComponent).ToString();
                        string extent = ".";
                        byte[] img = null;
                        switch (filter)
                        {
                            case "/FlateDecode":
                                byte[] arr = FlateDecodeFilter.FlateDecode(null, true);
                                Bitmap bmp = new Bitmap(Int32.Parse(width), Int32.Parse(height), PixelFormat.Format24bppRgb);
                                BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), ImageLockMode.WriteOnly,
                                    PixelFormat.Format24bppRgb);
                                Marshal.Copy(arr, 0, bmd.Scan0, arr.Length);
                                bmp.UnlockBits(bmd);
                                bmp.Save("d:\\pdf\\bmp1.png", ImageFormat.Png);
                                break;
                            case "/CCITTFaxDecode":
                                break;
                            default:
                                break;
                        }
                    }
                }
            }
        }
        catch
        {
            throw;
        }
        return "";
    }

"it is returning null" nothing in the code you've posted returns null. — Ian Kemp, Oct 17 '19 at 11:41
the correct is document.GetPdfObject(objectNumber), not document.GetPdfObject(pageNumber) — Tomex Ou, Mar 23 '22 at 17:23

score 0 · Answer 1 · answered Oct 17 '19 at 12:12

When you use Quickwatch on the pd value, what do you see is in there? The documentation of the iText 7 states is a dictionary, so perhaps you can check which types are available and find the appropriate field that you're looking for.

PdfDictionary pd = (PdfDictionary)obj;

Documentation can be found overhere: https://api.itextpdf.com/iText7/dotnet/7.1.8/classi_text_1_1_kernel_1_1_pdf_1_1_pdf_dictionary.html

score 0 · Answer 2 · answered Oct 17 '19 at 14:08

The idea of your approach is to check every indirect object in it whether it is an image XObject and extract the contained image data therein if it is.

Actually, though, you only iterate over the values 1..document.GetNumberOfPages() as object numbers, i.e. only over a fraction of the indirect objects of your document!

Indeed, there are more indirect objects in a PDF than there are pages, usually very many more.

Thus, iterate instead up to document.GetNumberOfPdfObjects()-1.

How to extract images from pdf using iText7 c#

2 Answers2

Linked