How can I get the text from PDF page in Objective-C?
Asked
Active
Viewed 5,087 times
3
-
Duplicate question. See http://stackoverflow.com/questions/3287635/how-to-parse-pdf-in-objective-c-for-ipad – Avi Feb 24 '12 at 08:36
-
So where is the answer there? – demon9733 Feb 24 '12 at 08:38
-
1@Avram that question has nothing to do with text extraction from PDF – hoha Feb 24 '12 at 08:39
-
I'm sorry, wrong link. See: http://stackoverflow.com/questions/2960195/extracting-pdf-text-in-objective-c – Avi Feb 24 '12 at 08:41
-
I see. Still "solution" represented there is crappy at best. It will not work for any nontrivial PDFs. – hoha Feb 24 '12 at 08:53
2 Answers
5
First of all - give up on any "quick & dirty" solution for parsing PDF - it will fail miserably. My colleague spent a lot of time trying to solve this problem correctly in iOS. His top 3 (by quality, descending) options:
- muPDF (http://www.mupdf.com/) Great library - it will do extraction fine. It is licensed under GPL though which is a show stopper for our proprietary application.
- Homemade solution based on the CGPDFScanner. You can find a short description of how to do this here . The main problem of this approach is SDK itself - Apple's API for PDF is severely (and deliberately I suspect) limited. For example you'll have to lay out extracted text blocks in 2D space because PDF doesn't guarantee that order of drawing matches text flow and iOS SDK is not a bit helpful here.
- Poppler (http://poppler.freedesktop.org/) is OK but for the text extraction it is a rough equivalent of the second option (with tons of additional dependencies).
There can be more options with Mac OS X but I don't know them.
2
Is this for iOS or OS X? If for OS X you could simply create an Automator workflow to extract the text, and call that workflow from your app. Automator has a PDF action "Extract PDF Text" for exactly this purpose. The Automator framework allows calling of automator actions from your app. And some sample code can be found at http://rogueamoeba.com/utm/2005/06/03/ (note that the actual code has been updated to make use of the Automator framework).

VsSoft
- 288
- 1
- 11
-
Then as mentioned below you'll need to use a 3rd party library or develop your own. Besides the ones already mentioned you might check out https://github.com/KurtCode/PDFKitten/ (searching capability, but can pull out text as well) and https://github.com/mobfarm/FastPdfKit (free version as well as paid versions available) – VsSoft Feb 25 '12 at 13:46