0

I m trying to extract images from a pdf file using itextsharp

an example pdf i m using here

The code i m using is:-

static void Main(string[] args)
    {

        try
        {
            WriteImageFile(); // write image file
            System.Console.WriteLine(AppDomain.CurrentDomain.BaseDirectory);
            System.Console.ReadLine();
        }
        catch (Exception ex)
        {
            System.Console.WriteLine(ex.Message);
        }
    }

    private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
    {
        List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();

        iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
        iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
        iTextSharp.text.pdf.PdfObject PDFObj = null;
        iTextSharp.text.pdf.PdfStream PDFStremObj = null;

        try
        {
            RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
            PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);
            if (PDFReaderObj.IsOpenedWithFullPermissions)
            {
                Debug.Print("this is a test");
            }

            for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
            {
                PDFObj = PDFReaderObj.GetPdfObject(i);

                if ((PDFObj != null) && PDFObj.IsStream())
                {
                    PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
                    iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);

                    if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
                    {
                        byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj);

                        if ((bytes != null))
                        {
                            try
                            {
                                System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes);

                                MS.Position = 0;
                                System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);

                                ImgList.Add(ImgPDF);

                            }
                            catch (Exception e)
                            {
                                Console.WriteLine  ("Exception in extract: " + e);
                            }
                        }
                    }
                }
            }
            PDFReaderObj.Close();
        }
        catch (Exception ex)
        {
            throw new Exception(ex.Message);
        }
        return ImgList;
    }


    private static void WriteImageFile()
    {
        try
        {
            System.Console.WriteLine("Wait for extracting image from PDF file....");

            // Get a List of Image
            List<System.Drawing.Image> ListImage = ExtractImages(@"C:\Users\pradyut.bhattacharya\Documents\CEVA PDF\more\CS_75.pdf");

            for (int i = 0; i < ListImage.Count; i++)
            {
                try
                {
                    // Write Image File
                    ListImage[i].Save(@"C:\Users\pradyut.bhattacharya\Documents\CEVA PDF\more\Image" + i + ".jpeg", System.Drawing.Imaging.ImageFormat.Jpeg);
                    System.Console.WriteLine("Image" + i + ".jpeg write sucessfully");
                }
                catch (Exception)
                { }
            }

        }
        catch (Exception ex)
        {
            throw new Exception(ex.Message);
        }
    }

Now in some cases i can get the images but for most of the PDFs which contains papers scanned i get the error:-

    A first chance exception of type 'System.ArgumentException' occurred in System.Drawing.dll
    Exception in extract: System.ArgumentException: Parameter is not valid.
       at System.Drawing.Image.FromStream(Stream stream, Boolean useEmbeddedColorManagement, Boolean validateImageData)
       at System.Drawing.Image.FromStream(Stream stream)
       at ConsoleApplication1.Program.ExtractImages(String PDFSourcePath) in C:\Users\pradyut.bhattacharya\Documents\Visual Studio 

    2010\Projects\ConsoleApplication2\ConsoleApplication2\Program.cs:line 67
    A first chance exception of type 'System.ArgumentException' occurred in System.Drawing.dll
    Exception in extract: System.ArgumentException: Parameter is not valid.
       at System.Drawing.Image.FromStream(Stream stream, Boolean useEmbeddedColorManagement, Boolean validateImageData)
       at System.Drawing.Image.FromStream(Stream stream)
       at ConsoleApplication1.Program.ExtractImages(String PDFSourcePath) in C:\Users\pradyut.bhattacharya\Documents\Visual Studio 

    2010\Projects\ConsoleApplication2\ConsoleApplication2\Program.cs:line 67

Any help

Thanks

Pradyut Bhattacharya
  • 5,440
  • 13
  • 53
  • 83

3 Answers3

1

Old question, I know, but I've actually found a somewhat decent solution for this. I too was having difficulty with extracting images from PDFs that had JBig2 encodings. The newer versions (post 4.1.6) of iTextSharp actually support it, but those versions are now under the AGPL license.

Using version 1 of this library by JPedal (version 2 is not free), you can convert JBig2 encoded images to a System.Drawing.Bitmap and save it/modify it however you want. However, this library will only decode the data, it will not be able to encode an image into the JBig2 format.

A small, but very minor, caveat, is that the library is in Java. This is not at all an issue for a C# user though, thanks to IKVM. IKVM, if you didn't already know about it, has a full java VM that runs in .NET and has native .NET implementations of the java class libraries. It's very easy to setup, and I literally just tested this all myself about 2 hours ago.

After you've downloaded IKVM and the JBig2 jar from the above link, you can execute this command to have IKVM convert the jar into a native .NET dll.

ikvmc -target:library [path to jbig2.jar]

That will output a .NET dll, named jbig2.dll, either into the same directory of the jar or ikvmc executable (I can't remember which). Then, reference jbig2.dll, IKVM.OpenJDK.Core, IKVM.OpenJDK.Media, IKVM.OpenJDK.SwingAWT and IKVM.Runtime in your project. I've used code similar to the following to extract the image:

// code to iterate over PDF objects and get bytes of a valid image elided
var imageBytes = GetRawImageBytesFromPdf();

if (filterType.Equals(PdfName.JBIG2DECODE))
{
    var jbg2 = new JBIG2Decoder();

    // Some JBig2 will extract without setting the JBig2Globals
    var decodeParams = stream.GetAsDict(PdfName.DECODEPARMS);
    if(decodeParams != null)
    {
        var globalRef = decodeParams.GetAsIndirectObject(
                                        PdfName.JBIG2GLOBALS);
        if(globalRef != null)
        {
            var globals = PdfReader.GetPdfObject(globalRef);
            var globalStream = globals as PRStream;
            var globalBytes = PdfReader.GetStreamBytesRaw(globalStream);

            if (globalBytes != null)
            {
                jbg2.setGlobalData(globalBytes);
            }
        }
    }

    jbg2.decodeJBIG2(imageBytes);

    var pages = jbg2.getNumberOfPages();

    for(int p = 0; p < pages; p++)
    {
        java.awt.image.BufferedImage bufImg = jbg2.getPageAsBufferedImage(p);

        var bitmap = bufImg.getBitmap();
        bitmap.Save(@"c:\path\to\file.tif", ImageFormat.Tiff);
        // note: I am unsure about the need to free the memory of the internal
        //       bitmap used in the BufferedImage class.  The docs for IKVM and
        //       that class should probably be consulted to find out if that
        //       should be done.
    }
}
// handle other formats like CCITTFAXDECODE

It does the job well, though the library isn't the fastest (this is unrelated to the fact it's used in IKVM, the developers admit version 1 of this library is inefficient). I'm in not in love with writing/editing java code, so if I wanted to improve the speed myself, I figure I'd probably just work straight at porting it to C# code. However, there is another fork of this java code at this github project, that claims a 2.5-4.5x increase in speed. You could probably compile that jar and use ikvmc with that.

Hope this helps anyone still looking for a solution to this problem!

Christopher Currens
  • 29,917
  • 5
  • 57
  • 77
  • whenever I try to call "jbgI.getPageAsBufferedImage" I get an exception "The type initializer for 'java.awt.image.DataBuffer' threw an exception. ---> The type initializer for 'sun.awt.image.SunWritableRaster' threw an exception. ---> The type initializer for 'java.awt.image.WritableRaster' threw an exception. ---> The type initializer for 'java.awt.image.Raster' threw an exception. ---> The type initializer for 'java.awt.image.ColorModel' threw an exception. ---> java.lang.UnsatisfiedLinkError: no awt in java.library.path", maybe these libraries do not work anymore? – leandro koiti Aug 13 '14 at 01:06
  • What is bytes? On this line jbg2.decodeJBIG2(bytes); ? – Ray May 12 '15 at 19:31
  • @Ray Those are the bytes of the image from the PDF. I think it would be something like this: `byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj);` (copied from user's question). – Christopher Currens May 12 '15 at 21:00
1

Images within a PDF can be stored in a variety of ways. Your code will work for all types that the .Net Framework has decoders for but will fail for ones it doesn't. Specifically your code is failing because that PDF has images encoded as JBIG2Decode. You can check this by looking at the PDFStremObj /FILTER property.

PdfObject filterType = PDFStremObj.Get(PdfName.FILTER);
if(filterType.Equals(PdfName.JBIG2DECODE)){
    //...
}

For types that the framework doesn't know about you'll either need a library or write your own decoder unfortunately.

See this post for some other libraries that do it. Here's Wikipedia's entry on JBIG if you want to try to roll your own. And here's one more post that shows some encoders that might also support decoding which is what you need.

Community
  • 1
  • 1
Chris Haas
  • 53,986
  • 12
  • 141
  • 274
0

Thanks for sharing this idea.

His solution was the most elegant I found using the free version of iTextsharper.

As you suggested I included the libraries:

jbig2dec.dll (generated from promt >ikmvc jbig2dec.jar)
ICSharpCode.SharpZipLib
IKVM.Runtime
IKVM.OpenJDK.Core
IKVM.OpenJDK.Media
IKVM.OpenJDK.SwingAWT
Taisbevalle
  • 246
  • 1
  • 9
  • 19