iOS PDF to plain text parser

Question

I'm quite at a lost on this subject. I've read pretty much every post about it here on SO, I would very much appreciate it if somebody would nudge me in the right direction.

I have a PDF and I would like to extract it's text, I'm only interested in words and spaces. I have setup a CGPDFScanner and it's callback methods. What I have read is that I only need to consider 4 operators TJ, Tj, qout(') and doubleqout(") as far as extracting text goes.

I guess I also need to keep track of the text space to be able to determine whether the letters should be put together to form a word or should be separated by a space. But I have no idea how I would have to do this.

In the PDF, all text is in the format

[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ

but I have not been able to figure out (using the PDF specification) what these numbers mean. Somebody on SO said that you should not be scared of the PDF specs but frankly I do not find them very easy to read/understand.

I have studied the PDFKitten code which was helpful.

Any help would be greatly appreciated.

score 6 · Answer 1 · answered Sep 17 '12 at 18:39

I cannot give you advice how to extract words from PDF, but the format of

[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ

is explained for example in the PDF 1.7 Specification, section "9.4.3 Text-Showing Operators". The description of the TJ operator is:

Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space.

So the numbers are adjustments to the distance between the letters.

iOS PDF to plain text parser

1 Answers1

Linked