1

I am looking to simply extract all images from a pdf. I found some code that looks like it is exactly what I need

Private Sub getAllImages(ByVal dict As pdf.PdfDictionary, ByVal images As List(Of Byte()), ByVal doc As pdf.PdfReader)
Dim res As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(dict.Get(pdf.PdfName.RESOURCES)), pdf.PdfDictionary)
Dim xobj As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(res.Get(pdf.PdfName.XOBJECT)), pdf.PdfDictionary)

If xobj IsNot Nothing Then
    For Each name As pdf.PdfName In xobj.Keys
        Dim obj As pdf.PdfObject = xobj.Get(name)
        If (obj.IsIndirect) Then
            Dim tg As pdf.PdfDictionary = CType(pdf.PdfReader.GetPdfObject(obj), pdf.PdfDictionary)
            Dim subtype As pdf.PdfName = CType(pdf.PdfReader.GetPdfObject(tg.Get(pdf.PdfName.SUBTYPE)), pdf.PdfName)
            If pdf.PdfName.IMAGE.Equals(subtype) Then
                Dim xrefIdx As Integer = CType(obj, pdf.PRIndirectReference).Number
                Dim pdfObj As pdf.PdfObject = doc.GetPdfObject(xrefIdx)
                Dim str As pdf.PdfStream = CType(pdfObj, pdf.PdfStream)
                Dim bytes As Byte() = pdf.PdfReader.GetStreamBytesRaw(CType(str, pdf.PRStream))

                Dim filter As String = tg.Get(pdf.PdfName.FILTER).ToString
                Dim width As String = tg.Get(pdf.PdfName.WIDTH).ToString
                Dim height As String = tg.Get(pdf.PdfName.HEIGHT).ToString
                Dim bpp As String = tg.Get(pdf.PdfName.BITSPERCOMPONENT).ToString

                If filter = "/FlateDecode" Then
                    bytes = pdf.PdfReader.FlateDecode(bytes, True)
                    Dim pixelFormat As System.Drawing.Imaging.PixelFormat
                    Select Case Integer.Parse(bpp)
                        Case 1
                            pixelFormat = Drawing.Imaging.PixelFormat.Format1bppIndexed
                        Case 24
                            pixelFormat = Drawing.Imaging.PixelFormat.Format24bppRgb
                        Case Else
                            Throw New Exception("Unknown pixel format " + bpp)
                    End Select
                    Dim bmp As New System.Drawing.Bitmap(Int32.Parse(width), Int32.Parse(height), pixelFormat)
                    Dim bmd As System.Drawing.Imaging.BitmapData = bmp.LockBits(New System.Drawing.Rectangle(0, 0, Int32.Parse(width), Int32.Parse(height)), System.Drawing.Imaging.ImageLockMode.WriteOnly, pixelFormat)
                    Marshal.Copy(bytes, 0, bmd.Scan0, bytes.Length)
                    bmp.UnlockBits(bmd)
                    Using ms As New MemoryStream
                        bmp.Save(ms, System.Drawing.Imaging.ImageFormat.Png)
                        bytes = ms.GetBuffer
                    End Using
                End If
                images.Add(bytes)
            ElseIf pdf.PdfName.FORM.Equals(subtype) Or pdf.PdfName.GROUP.Equals(subtype) Then
                getAllImages(tg, images, doc)
            End If
        End If
    Next
End If End Sub

Now my issue is simply, how can I call this, I do not know what to set the dict variable to or the images list??

So in essance if I have a PDF located at C:\temp\test.pdf that contains images, how do I call this?

    Dim x As New FileStream("C:\image\test.pdf", FileMode.Open)
    Dim reader As New iTextSharp.text.pdf.PdfReader(x)
    getAllImages(?????, ?????? ,reader)
skolima
  • 31,963
  • 27
  • 115
  • 151
  • When referencing code that you found elsewhere please include a link referencing the source. Often the source has the answers and we don't have to reinvent the wheel every time. http://stackoverflow.com/a/1220959/231316 – Chris Haas Feb 13 '12 at 14:01

1 Answers1

2

The way this person wrote this method can seem weird if you don't understand the internals of PDFs and/or iTextSharp. The method takes three parameters, the first is a PdfDictionary which you obtain by calling GetPageN(Integer) on each of your pages. The second is a generic list which you need to init on your own before calling this. This method is intended to be called in a loop for each page in a PDF and each call will append images to this list. The last parameter you understand already.

So here's the code to call this method:

''//Source file to read images from
Dim InputFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "FileWithImages.pdf")

''//List to dump images into
Dim Images As New List(Of Byte())

''//Main PDF reader
Dim Reader As New PdfReader(InputFile)

''//Total number of pages in the PDF
Dim PageCount = Reader.NumberOfPages

''//Loop through each page (first page is one, not zero)
For I = 1 To PageCount
    getAllImages(Reader.GetPageN(I), Images, Reader)
Next

VERY, VERY IMPORTANT - iTextSharp is NOT a PDF renderer, it is a PDF composer. What this means is that it knows it has image-like objects but it doesn't necessarily know much about them. To say it another way, iTextSharp knows that a given byte array represents something that the PDF standard says is an image but it doesn't know or care if its a JPEG, TIFF, BMP or something else. All iTextSharp cares about is that this object has a few standard properties it can manipulate like X,Y and effective width and height. PDF renderers will handle the job of converting the bytes to an actual image. In this can, you are the PDF renderer so its your job to figure out how to process the byte array as an image.

Specifically, you'll see in that method that there's a line that reads:

If filter = "/FlateDecode" Then

This is often written as a select case or switch statement to process the various values of filter. The method you are referencing only handles FlateDecode which is pretty common although there are actually 10 standard filters such as CCITTFaxDecode, JBIG2Decode and DCTDecode (PDF Spec 7.4 - Filters). You should modify the method to include a catch of some sort (an Else or Default case) so that you are at least aware of images you aren't set up to process.

Additionally, within the /FlatDecode section you'll see this line:

Select Case Integer.Parse(bpp)

This is reading an attribute associated with the image object that tells the renderer how many bits should be used for each color when parsing. Once again, you are the PDF renderer in this case so its up to you to figure out what to do. The code that you referenced only accounts for monochrome (1 bpp) or truecolor (24 bpp) images but others should definitely be accounted for, especially 8 bpp.

So summing this up, hopefully the code works for you as is, but don't be surprised if it complains a lot and/or misses images. Extracting images can actually be very frustrating at times. If you do run into problems start a new question here referencing this one and hopefully we can help you more!

Chris Haas
  • 53,986
  • 12
  • 141
  • 274