Extract a single page (or range of pages) from pdf data without loading the whole pdf (which takes too much RAM sometimes)

Question

Using PDFKit in swift, you can use PDFDocument to open pdf files. That's easy and works well. But I'm building a custom pdf viewer (for comic book pdfs) that suits my needs and there is one problem I have. In a viewer, I don't need to have the whole pdf file in memory. I only need about a few pages at a time.

Also, the pdfs consist only of images. There's no text or anything.

When instantiating a PDFDocument, the whole pdf data is being loaded into memory. If you have really huge pdf files (over 1GB) this isn't optimal (and can crash on some devices). As far as I know, there's no way in PDFKit to only load parts of a pdf document.

Is there anything I can do about that? I haven't found a swift/obj-c library that can do this (though I don't really know the right keywords to search for it).

My workaround would be to preprocess pdfs and save each page as image in the .documents director (or similar) using FileManager. That would result in a tremendous amount of files but would solve the memory problem. I'm not sure I like this approach, though.

Update:

So I did what @Prcela and @Sahil Manchanda proposed. It seems to be working for now.

@yms: Hm, that could be a problem, indeed. Does this even happen when there are only images? Without anything else in the pdf.

@Carpsen90: They are local (saved in the documents directory).

EDIT: I haven't accepted the answer below, or given it the bounty. This was automatically. It does not solve the problem. It still loads the entire PDF into memory!

Maybe this can help. Look at the simpler answer from Surani: https://stackoverflow.com/questions/50195842/how-to-implement-a-pdf-viewer-that-loads-pages-asynchronously?noredirect=1&lq=1 — Krešimir Prcela, Sep 13 '18 at 09:54
This is an interesting idea. I will look into it and see if that would be possible in my case. Thank you! — SwiftedMind, Sep 13 '18 at 10:03
"In a viewer, I don't need to have the whole pdf file in memory." Actually, unless the PDF is linearized, you do. Non-linearized PDF may have objects defined in page 100 that are needed in page 1, and all the objects of the file may also be compressed in a single container object. Linearized PDFs are designed to be loaded progressively. — yms, Sep 14 '18 at 10:54
@Quantm instead of saving them as Images. You could split pdf into multiple small pdfs based on Chapters. That way there will be less number of files and you can use PDFKit effectively and efficiently — Sahil Manchanda, Sep 14 '18 at 11:55
"Does this even happen when there are only images?" for the shared objects part, it depends on the type of compression used, but in general yes, it may happen. Color-space definitions and color-palettes for example could be "optimized" and shared by several pages. On top of that, the whole-file-compression part is also very common and it is independent of the content of the pages. If all your files are generated by the same tool, you could post a sample file and I can take a look at the internal structure to tell you more about it. — yms, Sep 19 '18 at 10:53
Okay, then I think I have to stick with the way the guys above suggested. I have no influence on how the pdfs are generated. I don't generate them, unfortunately. — SwiftedMind, Sep 19 '18 at 11:00

AD Progress · Answer 1 · 2019-07-09T15:29:58.517

I have an idea how you could achieve this in PDFKit. After reading the documentation there is a function which allows for the selection of certain pages. Which would probably solve your problem if you would add it to a collectionFlowView.

func selection(from startPage: PDFPage, atCharacterIndex startCharacter: Int, to endPage: PDFPage, atCharacterIndex endCharacter: Int) -> PDFSelection?

However as I read that you mainly have images there is another function which allows to extract parts of the pdf based on CGPoints:

func selection(from startPage: PDFPage, at startPoint: CGPoint, to endPage: PDFPage, at endPoint: CGPoint) -> PDFSelection?

Also have a look at this: https://developer.apple.com/documentation/pdfkit/pdfview

as this might be what you need if you only want to view the pages without any annotations editing etc.

I also prepared a little code to extract one page below. Hope it helps.

import PDFKit
import UIKit

class PDFViewController: UIViewController {

    override func viewDidLoad() {
        super.viewDidLoad()

        guard let url = Bundle.main.url(forResource: "myPDF", withExtension: "pdf") else {fatalError("INVALID URL")}
        let pdf = PDFDocument(url: url)
        let page = pdf?.page(at: 10) // returns a PDFPage instance
        // now you have one page extracted and you can play around with it.
    }
}

EDIT 1: Have a look at this code extraction. I understand that the whole PDF gets loaded however this approach might be more memory efficient as perhaps iOS will be handling it better in a PDFView:

func readBook() {

if let oldBookView = self.view.viewWithTag(3) {
    oldBookView.removeFromSuperview()
    // This removes the old book view when the user chooses a new book language
}

if #available(iOS 11.0, *) {
    let pdfView: PDFView = PDFView()
    let path = BookManager.getBookPath(bookLanguageCode: book.bookLanguageCode)
    let url = URL(fileURLWithPath: path)
    if let pdfDocument = PDFDocument(url: url) {
        pdfView.displayMode = .singlePageContinuous
        pdfView.autoScales = true
        pdfView.document = pdfDocument
        pdfView.tag = 3 // I assigned a tag to this view so that later on I can easily find and remove it when the user chooses a new book language
        let lastReadPage = getLastReadPage()

        if let page = pdfDocument.page(at: lastReadPage) {
            pdfView.go(to: page)
            // Subscribe to notifications so the last read page can be saved
            // Must subscribe after displaying the last read page or else, the first page will be displayed instead
            NotificationCenter.default.addObserver(self, selector: #selector(self.saveLastReadPage),name: .PDFViewPageChanged, object: nil)
        }
    }

    self.containerView.addSubview(pdfView)
    setConstraints(view: pdfView)
    addTapGesture(view: pdfView)
}

EDIT 2: this is not the answer the OP was looking for. This also loads the whole pdf into the memory. Read comments

`let pdf = PDFDocument(url: url)`. This would still load the entire pdf into memory, which is what I want to avoid. Grabbing pages from the document is not the problem. — SwiftedMind, Sep 15 '18 at 19:40
have you tried opening the PDF in a Webview to see if it gives you similar results here is a tutorial for it https://pspdfkit.com/blog/2016/opening-a-pdf-in-swift/ — AD Progress, Sep 15 '18 at 19:44
Well, in my case a web view wouldn't work. Also, all of Apple's default pdf viewing classes that I've tried so far are literally terrible. A web view probably uses the same techniques to render the pdf and that's really bad. — SwiftedMind, Sep 15 '18 at 19:51
I would give CocoaPods a shot or GitHub search maybe this could help you — AD Progress, Sep 15 '18 at 19:52
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/180125/discussion-between-ad-progress-and-quantm). — AD Progress, Sep 15 '18 at 19:55
No problem just read in your spare moment only left a few comments there. — AD Progress, Sep 15 '18 at 19:57
For everyone reading this question: This is not the correct answer. It still loads the entire PDF! My problem isn't loading the pdf or showing it. I need to avoid this: if `let pdfDocument = PDFDocument(url: url)` — SwiftedMind, Sep 22 '18 at 09:49

Extract a single page (or range of pages) from pdf data without loading the whole pdf (which takes too much RAM sometimes)

1 Answers1