3

I am trying to parse pdf content in order to search and highlight text. I managed with CGPDF stuff to find text with TJ and Tj operators and say in which page the word is. The problem comes with the highlighting.

I followed many other posts such as this Getting text position or this Pdf search .

I know the operators for text positioning are Tm (text matrix), TD and Td (T* maybe), But I cant figure out how to use this information.

When I print the Tm value i get a nine-number integer, I can assume this is a 3x3 matrix. I can give you the output:

2011-03-23 10:59:07.894 PDFSearch[11035:40b] BT(I) 161361744:

2011-03-23 10:59:07.896 PDFSearch[11035:40b] TM(I) 161361104:

2011-03-23 10:59:07.897 PDFSearch[11035:40b] Tf(I) 161361616:

2011-03-23 10:59:07.899 PDFSearch[11035:40b] TJ: R

2011-03-23 10:59:07.899 PDFSearch[11035:40b] TJ: e

2011-03-23 10:59:07.901 PDFSearch[11035:40b] TJ: t

2011-03-23 10:59:07.901 PDFSearch[11035:40b] TJ: i

2011-03-23 10:59:07.903 PDFSearch[11035:40b] TJ: co

2011-03-23 10:59:07.903 PDFSearch[11035:40b] TJ: l

2011-03-23 10:59:07.905 PDFSearch[11035:40b] TJ: o

2011-03-23 10:59:07.907 PDFSearch[11035:40b] ET(I) 161361872:

Any idea how to use it to find text positioning? And use it to drow a box on the pdf view with quartz2D?

Thanks :)

Community
  • 1
  • 1

2 Answers2

3

The Tm operator has six parameters, so you need to use CGPDFScannerPopNumber six times which will get you six float values that you can use to construct a CGAffineTransform. The e and f parameters correspond to tx and ty, otherwise the fields are equally named.

Refer to the PDF specification for more details, specifically the chapter about text (page 250 covers the Tm operator).

Remember that the operands are popped from a stack, so f will be the first value that you get and a the last.

omz
  • 53,243
  • 5
  • 129
  • 141
0

Check out PDFKitten, open source project, they parse all the TJ, Tj, TM and other containers to calculate the text position on screen. It's not perfect, but a start. Searching in pdfs can be tricky, there are so many ways to make pdf display text, some of them are not even fonts at all.

steipete
  • 7,581
  • 5
  • 47
  • 81