3

How can I get the text from PDF page in Objective-C?

demon9733
  • 1,054
  • 3
  • 13
  • 35

2 Answers2

5

First of all - give up on any "quick & dirty" solution for parsing PDF - it will fail miserably. My colleague spent a lot of time trying to solve this problem correctly in iOS. His top 3 (by quality, descending) options:

  1. muPDF (http://www.mupdf.com/) Great library - it will do extraction fine. It is licensed under GPL though which is a show stopper for our proprietary application.
  2. Homemade solution based on the CGPDFScanner. You can find a short description of how to do this here . The main problem of this approach is SDK itself - Apple's API for PDF is severely (and deliberately I suspect) limited. For example you'll have to lay out extracted text blocks in 2D space because PDF doesn't guarantee that order of drawing matches text flow and iOS SDK is not a bit helpful here.
  3. Poppler (http://poppler.freedesktop.org/) is OK but for the text extraction it is a rough equivalent of the second option (with tons of additional dependencies).

There can be more options with Mac OS X but I don't know them.

Community
  • 1
  • 1
hoha
  • 4,418
  • 17
  • 15
2

Is this for iOS or OS X? If for OS X you could simply create an Automator workflow to extract the text, and call that workflow from your app. Automator has a PDF action "Extract PDF Text" for exactly this purpose. The Automator framework allows calling of automator actions from your app. And some sample code can be found at http://rogueamoeba.com/utm/2005/06/03/ (note that the actual code has been updated to make use of the Automator framework).

VsSoft
  • 288
  • 1
  • 11
  • Then as mentioned below you'll need to use a 3rd party library or develop your own. Besides the ones already mentioned you might check out https://github.com/KurtCode/PDFKitten/ (searching capability, but can pull out text as well) and https://github.com/mobfarm/FastPdfKit (free version as well as paid versions available) – VsSoft Feb 25 '12 at 13:46