4

In follow up of my previous questions, trying to extract the text from a PDF file using the CGPDF* functions, having a:

CGPDFStringRef pdfString

I figured out that it can be converted to an array of character codes like this:

const unsigned char *characterCodes = CGPDFStringGetBytePtr(pdfString);

Now, the text I'm trying to extract is written in one of the 14 type 1 base font's which is not encode in the PDF itself. Therefor, I have parsed the relevant AFM file for that font giving me a mapping from character code to glyph name and it's dimensions like so:

C 61 ; WX 600 ; N equal ; B 80 138 520 376 ;
C 63 ; WX 600 ; N question ; B 129 -15 492 572 ;
C 64 ; WX 600 ; N at ; B 77 -15 533 622 ;
C 65 ; WX 600 ; N A ; B 3 0 597 562 ;
C 66 ; WX 600 ; N B ; B 43 0 559 562 ;

My question is, knowing the character code, say:"61" how do I go from it's glyph name:"equal" to a NSString @"=". Especially when that character code is remapped to an other glyph name, say, for instance: "question" by the PDF's font encoding option.

Previous questions: iOS PDF parsing Type 1 Fonts metrics and iOS PDF to plain text parser

Community
  • 1
  • 1
DIJ
  • 347
  • 4
  • 19

1 Answers1

2

I have not tested this, but it seems to me that you need to use the Adobe Glyph Naming convention for this:

The purpose of the Adobe Glyph Naming convention is to support the computation of a Unicode character string from a sequence of glyphs. This is achieved by specifying a mapping from glyph names to character strings.

The glyphlist.txt linked on that page seems to be relevant for your issue.
Sample fragment:

...
epsilon;03B5
epsilontonos;03AD
equal;003D
equalmonospace;FF1D
equalsmall;FE66
equalsuperior;207C
...

Then all you need to do is putting those unicode values in your NSString instance.

Edit:
Confirming the information provided above, I found the following explanation on the PDF Reference Document from Adobe, Section 5.9 - Extraction of Text Content:

If the font is a simple font that uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font (see Appendix D):

  1. Map the character code to a character name according to Table D.1 on page 996 and the font’s Differences array.
  2. Look up the character name in the Adobe Glyph List (see the Bibliography) to obtain the corresponding Unicode value.
Community
  • 1
  • 1
yms
  • 10,361
  • 3
  • 38
  • 68
  • Thank you very much, I'm pressed for time at the moment on another project. Once I have time to verify your answer I will accept it. – DIJ Oct 17 '12 at 12:45
  • Thank you yms, I've actually read that outline of the PDF reference as well, it did not stick / make sense to me at that time. Thanks a lot! – DIJ Oct 17 '12 at 17:17