6

I understand PDFKit allows extracting text+formatting as NSAttributedString, but I can't find any info on extracting each individual figures from any PDF document using Swift.

Any help would be greatly appreciated, thanks!

edit: https://stackoverflow.com/a/40788449/2303865 explains how to convert the whole page into image, however I need to parse all images already part of the a series of PDF documents, without knowing where they are located, so that solution is not appropriate to my question.

Debee
  • 61
  • 1
  • 6
  • Thanks @Damon for the edit suggestion! You can tell I'm a newbie here :) – Debee Nov 24 '18 at 14:05
  • https://stackoverflow.com/a/40788449/2303865 – Leo Dabus Nov 24 '18 at 14:45
  • Thanks @LeoDabus , however that is not the solution to my problem: I am not trying to convert the whole page into an image, but to extract programmatically images in a PDF so I can display/save them. – Debee Nov 24 '18 at 15:36
  • So you need to edit your question, show what you have tried and the issues you are facing otherwise I can't reopen your question. – Leo Dabus Nov 24 '18 at 15:41
  • this question should be closed again – Leo Dabus Nov 24 '18 at 15:46
  • possible duplicate of https://stackoverflow.com/a/2492305/2303865 – Leo Dabus Nov 24 '18 at 15:46
  • Thanks @LeoDabus - http://stackoverflow.com/a/2492305/2303865 solution is in Obj-C and the thread is closed - I would have preferred to add a comment there asking for a Swift 4 version of the solution, but I couldn't (need min 10 reputation). – Debee Nov 24 '18 at 15:58
  • https://objectivec2swift.com/#/converter/code/ – Leo Dabus Nov 24 '18 at 15:59
  • 1
    Amazing, thanks a lot @LeoDabus for this, and the extra patience with a newbie like me! – Debee Nov 24 '18 at 16:20
  • @Debee do you converted it to Swift ? or find another solution ? I was looking for a way to extract images from PDF but I don't find any correct solution , I also tried to converted the code to swift but without any luck , faced a lot of issues – Basel Jul 16 '19 at 18:33

1 Answers1

5

Here is a Swift function that extracts images, more specifically all Objects with Subtype "Image" from pdf pages:

import PDFKit

func extractImages(from pdf: PDFDocument, extractor: @escaping (ImageInfo)->Void) throws {
    for pageNumber in 0..<pdf.pageCount {
        guard let page = pdf.page(at: pageNumber) else {
            throw PDFReadError.couldNotOpenPageNumber(pageNumber)
        }
        try extractImages(from: page, extractor: extractor)
    }
}

func extractImages(from page: PDFPage, extractor: @escaping (ImageInfo)->Void) throws {
    let pageNumber = page.label ?? "unknown page"
    guard let page = page.pageRef else {
        throw PDFReadError.couldNotOpenPage(pageNumber)
    }

    guard let dictionary = page.dictionary else {
        throw PDFReadError.couldNotOpenDictionaryOfPage(pageNumber)
    }

    guard let resources = dictionary[CGPDFDictionaryGetDictionary, "Resources"] else {
        throw PDFReadError.couldNotReadResources(pageNumber)
    }

    if let xObject = resources[CGPDFDictionaryGetDictionary, "XObject"] {
        print("reading resources of page", pageNumber)

        func extractImage(key: UnsafePointer<Int8>, object: CGPDFObjectRef, info: UnsafeMutableRawPointer?) -> Bool {
            guard let stream: CGPDFStreamRef = object[CGPDFObjectGetValue, .stream] else { return true }
            guard let dictionary = CGPDFStreamGetDictionary(stream) else {return true}

            guard dictionary.getName("Subtype", CGPDFDictionaryGetName) == "Image" else {return true}

            let colorSpaces = dictionary.getNameArray(for: "ColorSpace") ?? []
            let filter = dictionary.getNameArray(for: "Filter") ?? []

            var format = CGPDFDataFormat.raw
            guard let data = CGPDFStreamCopyData(stream, &format) as Data? else { return false }

            extractor(
              ImageInfo(
                name: String(cString: key),
                colorSpaces: colorSpaces,
                filter: filter,
                format: format,
                data: data
              )
            )

            return true
        }

        CGPDFDictionaryApplyBlock(xObject, extractImage, nil)
    }
}

struct ImageInfo: CustomDebugStringConvertible {
    let name: String
    let colorSpaces: [String]
    let filter: [String]
    let format: CGPDFDataFormat
    let data: Data

    var debugDescription: String {
        """
          Image "\(name)"
           - color spaces: \(colorSpaces)
           - format: \(format == .JPEG2000 ? "JPEG2000" : format == .jpegEncoded ? "jpeg" : "raw")
           - filters: \(filter)
           - size: \(ByteCountFormatter.string(fromByteCount: Int64(data.count), countStyle: .binary))
        """
    }
}

extension CGPDFObjectRef {
    func getName<K>(_ key: K, _ getter: (OpaquePointer, K, UnsafeMutablePointer<UnsafePointer<Int8>?>)->Bool) -> String? {
        guard let pointer = self[getter, key] else { return nil }
        return String(cString: pointer)
    }

    func getName<K>(_ key: K, _ getter: (OpaquePointer, K, UnsafeMutableRawPointer?)->Bool) -> String? {
        guard let pointer: UnsafePointer<UInt8> = self[getter, key] else { return nil }
        return String(cString: pointer)
    }

    subscript<R, K>(_ getter: (OpaquePointer, K, UnsafeMutablePointer<R?>)->Bool, _ key: K) -> R? {
        var result: R!
        guard getter(self, key, &result) else { return nil }
        return result
    }

    subscript<R, K>(_ getter: (OpaquePointer, K, UnsafeMutableRawPointer?)->Bool, _ key: K) -> R? {
        var result: R!
        guard getter(self, key, &result) else { return nil }
        return result
    }

    func getNameArray(for key: String) -> [String]? {
        var object: CGPDFObjectRef!
        guard CGPDFDictionaryGetObject(self, key, &object) else { return nil }

        if let name = object.getName(.name, CGPDFObjectGetValue) {
            return [name]
        } else {
            guard let array: CGPDFArrayRef = object[CGPDFObjectGetValue, .array] else {return nil}
            var names = [String]()
            for index in 0..<CGPDFArrayGetCount(array) {
                guard let name = array.getName(index, CGPDFArrayGetName) else { continue }
                names.append(name)
            }
            return names
        }
    }
}

enum PDFReadError: Error {
    case couldNotOpenPageNumber(Int)
    case couldNotOpenPage(String)
    case couldNotOpenDictionaryOfPage(String)
    case couldNotReadResources(String)
    case cannotReadXObjectStream(xObject: String, page: String)
}

You should know that images in PDFs can be represented in different ways. They can be embedded as self contained JPGs or they can be embedded as raw pixel data (lossless compressed or not) with meta information about the compression, color space, width, height, and so forth.

So if you want to export embedded JPGs: this code works just fine. But if you also want to visualise the raw images you will need even more parsing code. To get started you can look at the PDF 2.0 spec (or an older free version of the spec), and this gist which interprets JPGs in any color profile and raw images with any of the following color profiles:

  • DeviceGray
  • DeviceRGB
  • DeviceCMYK
  • Indexed
  • ICCBased
Damiaan Dufaux
  • 4,427
  • 1
  • 22
  • 33