Say I have a PDF file that contains one or more embedded fonts. Here's my understanding of how a single character of text is rendered:
- First, determine which font the character uses.
- Use the font's "cmap," embedded in the PDF, to determine the font's glyph name for the given character. For example, the character '&' in PDF text might map to a glyph that the font internally calls 'ampersand'.
- Use the font's "glyf" table to determine the bounding box / drawing instructions for the glyph name.
Here's my question: is a PDF cmap generally consistent? Put another way, if I encounter the character "&" in a PDF, can I be assured that the cmap will always map "&" to the ampersand glyph? Or does some PDF-generation software create its own arbitrary mapping between character codes and glyph names (which would be rather evil and possibly break in-PDF searching and text selection)?
Of course I realize it's possible for the cmap to use an unintuitive mapping -- I guess I'm asking, does this actually happen in the Real World?
My specific use-case is in the world of music fonts. I'm analyzing characters in a PDF to determine which music glyph each one represents (e.g., treble clef, notehead, etc.). I want to know how confident I can be that the combination of font name and character code will always result in the same glyph. For example, if I know the font name is "Opus" and the glyph is "#", can I assume that will always be mapped to the treble clef glyph? Or do I have to analyze the glyph's metrics to make sure it's actually a treble clef?