FontTools: extracting useful UTF information provided by it

Question

FontTools is producing some XML with all sorts of details in this structure

  <cmap>
    <tableVersion version="0"/>
    <cmap_format_4 platformID="0" platEncID="3" language="0">
      <map code="0x20" name="space"/><!-- SPACE -->
         <!--many, many more characters-->
    </cmap_format_4>
    <cmap_format_0 platformID="1" platEncID="0" language="0">
      <map code="0x0" name=".notdef"/>
         <!--many, many more characters again-->
    </cmap_format_0>
    <cmap_format_4 platformID="0" platEncID="3" language="0"> <!--"cmap_format_4" again-->
      <map code="0x20" name="space"/><!-- SPACE -->
         <!--more "map" nodes-->
    </cmap_format_4>
 </cmap>

I'm trying to figure out every character this font supports, so these code attributes are what I'm interested in. I believe I am correct in thinking that all code attributes are UTF-8 values: is this correct? I am curious why there are two nodes cmap_format_4 (they seem to be identical, but I haven't tested that with a thorough amount of fonts those, so if someone familiar with this module knows for certain, that is my first question).

To be assured I am seeing all characters contained in the typeface, do I need to combine all code attribute values, or just one or two. Will FontTools always produce these three XML nodes, or is the quantity variable? Any idea why? The documentation is a little vague.

djangodude · Answer 1 · 2015-04-03T15:37:05.237

the number of cmap_format_N nodes ("cmap subtables") is variable, as is the 'N' (the format). There are several formats; the most common is 4, but there is also format 12, format 0, format 6, and a few others.
fonts may have multiple cmap subtables, but are not required to. The reason for this is the history of the development of TrueType (which has evolved into OpenType). The format was invented before Unicode, at a time when each platform had their own way(s) of character mapping. The different formats and ability to have multiple mappings was necessity at the time in order to have a single font file that could map everything without multiple files, duplication, etc. Nowadays most fonts that are produced will only have a single Unicode subtable, but there are many floating around that have multiple subtables.
The code values in the map node are code point values expressed as hexadecimal. They might be Unicode values, but not necessarily (see the next point).
I think your font may be corrupted (or possibly there was copy/paste mix-up). It is possible to have multiple cmap_format_N entries in the cmap, but each combination of platformID/platformEncID/language should be unique. Also, it is important to note that not all cmap subtables map Unicodes; some express older, pre-Unicode encodings. You should look at tables where platformID="3" first, then platformID="0" and finally platformID="2" as a last resort. Other platformIDs do not necessarily map Unicode values.

As for discovering "all Unicodes mapped in a font": that can be a bit tricky when there are multiple Unicode subtables, especially if their contents differ. You might get close by taking the union of all code values in all of the subtables that are known to be Unicode maps, but it is important to understand that most platforms will only use one of the maps at a time. Usually there is a preferred picking order similar to what I stated above; when one is found, that is the one used. There's no standardized order of preference that applies to all platforms (that I'm aware of), but most of the popular ones follow an order pretty close to what I listed.

Finally, regarding Unicode vs UTF-8: the code values are Unicode code points; NOT UTF-8 byte sequences. If you're not sure of the difference, spend some time reading about character encodings and byte serialization at Unicode.org.

Thank you for a detailed response. What determines which of the subtables a particular platform will use? Will it always be the same for a particular platform? Why is there more than one subtable in the first place? — 1252748, Apr 03 '15 at 15:23
I've updated my answer with some additional information to answer your questions. — djangodude, Apr 03 '15 at 15:37

FontTools: extracting useful UTF information provided by it

1 Answers1