Read contents of pdf as string

Question

How can I read the contents of a PDF as a string in swift. I want to later filter this string and get certain text elements from it. The PDF is from a url, and I load it in a web view and cache it using an NSURL Extension. How can I take this webview and read the contents of the URL. I tried:

var urlAsString = String(contentsOfURL: NSURL(string: "http://web.shschools.org/shpid/pdfs/WXS5N48Z.pdf")!, encoding: NSUTF8StringEncoding, error: nil)

However, that did not work, I assume because the file is a PDF. Can I get some help?

You will need to load the pdf as an NSData then parse the data somehow — David Skrundz, Aug 02 '15 at 03:24
... = NSData(contentsOfURL:) // https://developer.apple.com/library/prerelease/ios/documentation/Cocoa/Reference/Foundation/Classes/NSData_Class/index.html#//apple_ref/occ/instm/NSData/initWithContentsOfURL: — David Skrundz, Aug 02 '15 at 03:27
Down voters... Please see that the other question is in `objective-c` not SWIFT. Please upvote — shramee, Jul 06 '16 at 15:51

score 4 · Accepted Answer · answered Aug 02 '15 at 11:07

If you want to avoid a lot of programming, you probably need to use some library which will help you extract text from PDFs.

You have two options:

1) Use OCR library. Since PDF can contain images besides text, performing OCR to get the text is the most generic solution. To perform OCR on a PDF document, you need to convert it to UIImage object. Another approach can be to convert contents of the WebView to UIImage, but this might result with image with lower resolution, which can affect OCR performance.

The downside to using OCR library is that you will not get 100% accurate text, since the OCR engine always introduces errors.

The best options for OCR are Tesseract for iOS (free, but with higher error rate and a bit more complex to tweak for results). A more robust option is BlinkOCR, which is free to try, paid when in commercial use, but you can get a ton of help from their engineers.

2) You can also use PDF library. PDF libraries can reliably extract text written in the document, with exception of text which is part of the images inside the PDF. So depending on the documents you want to read this might be a better option (or not).

Some options for PDF libraries can be found here, and in our experience, PDFlib gives very good results and is the most customizable.

score 2 · Answer 2 · answered Aug 02 '15 at 03:31

2

A PDF can be a variety of things, it may display text but not actually contain any text that can be parsed. (think of a fax to email service).

One idea would be to create an image context out of the web view then send it off to an OCR framework for character recognition. (Here's an OCR tutorial: http://www.raywenderlich.com/93276/implementing-tesseract-ocr-ios)

answered Aug 02 '15 at 03:31

Fred Faust

6,696
4
32
55

How could you convert that web view to an image? – modesitt Aug 02 '15 at 03:37
You'll have to start with this answer: http://stackoverflow.com/a/20795651/4096655 - read the answers to both questions as you'll have some decisions to make. (Scrolling the entire view vs resizing it, etc.) – Fred Faust Aug 02 '15 at 03:43

score -1 · Answer 3 · answered Aug 02 '15 at 07:31

To get any elements from text you can use this function

func parser (textToParse:String, strFrom:String, strTo:String) -> String {
        return textToParse.componentsSeparatedByString(strFrom)[1].componentsSeparatedByString(strTo)[0]
    }

var s=parser("abc", strFrom: "a", strTo: "c")
//s will be "b"

Read contents of pdf as string

3 Answers3