Do PDF files generally use "correct" character codes for font glyphs?

Question

Say I have a PDF file that contains one or more embedded fonts. Here's my understanding of how a single character of text is rendered:

First, determine which font the character uses.
Use the font's "cmap," embedded in the PDF, to determine the font's glyph name for the given character. For example, the character '&' in PDF text might map to a glyph that the font internally calls 'ampersand'.
Use the font's "glyf" table to determine the bounding box / drawing instructions for the glyph name.

Here's my question: is a PDF cmap generally consistent? Put another way, if I encounter the character "&" in a PDF, can I be assured that the cmap will always map "&" to the ampersand glyph? Or does some PDF-generation software create its own arbitrary mapping between character codes and glyph names (which would be rather evil and possibly break in-PDF searching and text selection)?

Of course I realize it's possible for the cmap to use an unintuitive mapping -- I guess I'm asking, does this actually happen in the Real World?

My specific use-case is in the world of music fonts. I'm analyzing characters in a PDF to determine which music glyph each one represents (e.g., treble clef, notehead, etc.). I want to know how confident I can be that the combination of font name and character code will always result in the same glyph. For example, if I know the font name is "Opus" and the glyph is "#", can I assume that will always be mapped to the treble clef glyph? Or do I have to analyze the glyph's metrics to make sure it's actually a treble clef?

score 3 · Accepted Answer · answered Aug 06 '14 at 06:44

3

It differs from one PDF creator to another.

A fairly common method (alas!) is "order encountered", where the first character in the text stream gets mapped to \001, the next to \002 and so on. So the text "Hello" would be encoded as \001\002\003\003\004.

I want to know how confident I can be that the combination of font name and character code will always result in the same glyph.

In a single PDF document, if the same font object is used in different contexts, it will be true -- the mapping is defined inside the font object. If you encounter another font object that uses the same font but it points to another font stream (i.e., the font subset is embedded twice), then it may not be true. Each subset may have an encoding of its own.

Only if the font object contains a /ToUnicode mapping, you can be confident that values map to the correct characters.

answered Aug 06 '14 at 06:44

Jongware

22,200
8
54
100

2

+1; *Only if the font object contains a /ToUnicode mapping, you can be confident that values map to the correct characters.* - You can be very confident but not 100% sure - there are PDFs which explicitly include false information in the **ToUnicode** map to prevent text extraction. – mkl Aug 06 '14 at 07:59
@mkl: ouch :) That's a new one. Is this a "feature" in any particular software you know? – Jongware Aug 06 '14 at 08:20
3

Have a look [here](http://stackoverflow.com/a/22688775/1729265): The software replaces the **ToUnicode** entry for one code with something wrong. Thus, all text extractors relying on **ToUnicode** alone extract something wrong. To make copy&paste from Adobe Reader work, though, it adds an **ActualText** structure element entry wherever that code is used indicating the correct Unicode code. – mkl Aug 06 '14 at 08:30

Do PDF files generally use "correct" character codes for font glyphs?

1 Answers1