84

How do I extract the list of supported Unicode characters from a TrueType or embedded OpenType font on Linux?

Is there a tool or a library I can use to process a .ttf or a .eot file and build a list of code points (like U+0123, U+1234, etc.) provided by the font?

Neil Mayhew
  • 14,206
  • 3
  • 33
  • 25
Till Ulen
  • 1,759
  • 1
  • 17
  • 17
  • 6
    Try `fc-list :charset=1234`, but double-check its output… (it does work for me, it shows Gentium as having 2082 but not 2161) – mirabilos Mar 27 '19 at 22:43
  • 1
    @mirabilos This isn't what the question asked. It shows the fonts that contain a given character (ie 1234). – Neil Mayhew Mar 28 '19 at 17:36
  • Oh right. But these two questions are interwoven (and you’ll find many answers to the wrong question in the Answers section). – mirabilos Mar 29 '19 at 00:54
  • @mirabilos Good point. I've edited the title slightly to make the intent of the question more obvious. – Neil Mayhew Mar 30 '19 at 15:04
  • Same question on UNIX.SE: [fonts - How to find out which unicode codepoints are defined in a TTF file? - Unix & Linux Stack Exchange](https://unix.stackexchange.com/q/247108/296692) -- include an answer using `otfinfo`. – user202729 Feb 16 '23 at 03:15

14 Answers14

55

Here is a method using the fontTools Python library (which you can install with something like pip install fonttools):

#!/usr/bin/env python
from itertools import chain
import sys

from fontTools.ttLib import TTFont
from fontTools.unicode import Unicode

with TTFont(
    sys.argv[1], 0, allowVID=0, ignoreDecompileErrors=True, fontNumber=-1
) as ttf:
    chars = chain.from_iterable(
        [y + (Unicode[y[0]],) for y in x.cmap.items()] for x in ttf["cmap"].tables
    )
    if len(sys.argv) == 2:  # print all code points
        for c in chars:
            print(c)
    elif len(sys.argv) >= 3:  # search code points / characters
        code_points = {c[0] for c in chars}
        for i in sys.argv[2:]:
            code_point = int(i)   # search code point
            #code_point = ord(i)  # search character
            print(Unicode[code_point])
            print(code_point in code_points)

The script takes as arguments the font path and optionally code points / characters to search for:

$ python checkfont.py /usr/share/fonts/**/DejaVuSans.ttf
(32, 'space', 'SPACE')
(33, 'exclam', 'EXCLAMATION MARK')
(34, 'quotedbl', 'QUOTATION MARK')
…

$ python checkfont.py /usr/share/fonts/**/DejaVuSans.ttf 65 12622  # a ㅎ
LATIN CAPITAL LETTER A
True
HANGUL LETTER HIEUH
False
Cristian Ciupitu
  • 20,270
  • 7
  • 50
  • 76
Janus Troelsen
  • 20,267
  • 14
  • 135
  • 196
  • 1
    `int(sys.argv[2], 0)` will probably fail with "invalid literal" in most case, since one probably wants to find special characters. Use `ord(sys.argv[2].decode('string_escape').decode('utf-8'))` instead. – Skippy le Grand Gourou Feb 07 '17 at 12:44
  • 2
    Anyway, this script based on `python-fontconfig` seems much faster : http://unix.stackexchange.com/a/268286/26952 – Skippy le Grand Gourou Feb 07 '17 at 12:44
  • @SkippyleGrandGourou That sentence seems right? It passes `sys.argv[1]` to `TTFont()`? – Martin Tournoij Feb 07 '17 at 15:28
  • 1
    You can simplify : `chars = chain.from_iterable([y + (Unicode[y[0]],) for y in x.cmap.items()] for x in ttf["cmap"].tables)` by `chars = list(y + (Unicode[y[0]],) for x in ttf["cmap"].tables for y in x.cmap.items())` – Ismael EL ATIFI Jul 09 '19 at 13:56
41

The X program xfd can do this. To see all characters for the "DejaVu Sans Mono" font, run:

xfd -fa "DejaVu Sans Mono"

It's included in the x11-utils package on Debian/Ubuntu, xorg-x11-apps on Fedora/RHEL, and xorg-xfd on Arch Linux.

Cristian Ciupitu
  • 20,270
  • 7
  • 50
  • 76
Spencer
  • 575
  • 4
  • 8
22

The fontconfig commands can output the glyph list as a compact list of ranges, eg:

$ fc-match --format='%{charset}\n' OpenSans
20-7e a0-17f 192 1a0-1a1 1af-1b0 1f0 1fa-1ff 218-21b 237 2bc 2c6-2c7 2c9
2d8-2dd 2f3 300-301 303 309 30f 323 384-38a 38c 38e-3a1 3a3-3ce 3d1-3d2 3d6
400-486 488-513 1e00-1e01 1e3e-1e3f 1e80-1e85 1ea0-1ef9 1f4d 2000-200b
2013-2015 2017-201e 2020-2022 2026 2030 2032-2033 2039-203a 203c 2044 2070
2074-2079 207f 20a3-20a4 20a7 20ab-20ac 2105 2113 2116 2120 2122 2126 212e
215b-215e 2202 2206 220f 2211-2212 221a 221e 222b 2248 2260 2264-2265 25ca
fb00-fb04 feff fffc-fffd

Use fc-query for a .ttf file and fc-match for an installed font name.

This likely doesn't involve installing any extra packages, and doesn't involve translating a bitmap.

Use fc-match --format='%{file}\n' to check whether the right font is being matched.

Neil Mayhew
  • 14,206
  • 3
  • 33
  • 25
  • This lies: it says “Gentium Italic” has, among others, “2150-2185”, but 2161 is definitely not in it. – mirabilos Mar 27 '19 at 22:41
  • 2
    @mirabilos I have Gentium 5.000 and it definitely does contain 2161: `ttx -t cmap -o - /usr/share/fonts/truetype/GentiumPlus-I.ttf | grep 0x2161` returns ``. It's possible FontConfig is matching to a different font. Before I installed `gentium`, `fc-match 'Gentium Italic'` returned `FreeMono.ttf: "FreeMono" "Regular"`. If so, the output of `--format=%{charset}` would not show what you expect. – Neil Mayhew Mar 28 '19 at 16:59
  • I added a note mentioning the need to check whether the right font is being matched – Neil Mayhew Mar 28 '19 at 17:28
  • Gentium Plus ≠ Gentium (I have all three, normal, Basic and Plus installed, but I was wondering about Gentium) – ah nvm, I see the problem: $ fc-match --format='%{file}\n' Gentium /usr/share/fonts/truetype/gentium/Gentium-R.ttf $ fc-match --format='%{file}\n' Gentium\ Italic /usr/share/fonts/truetype/dejavu/DejaVuSans.ttf $ fc-match --format='%{file}\n' Gentium:Italic /usr/share/fonts/truetype/gentium/Gentium-I.ttf And `fc-match --format='%{file} ⇒ %{charset}\n' Gentium:Italic` DTRT, wonderful. – mirabilos Mar 29 '19 at 00:50
  • 1
    Glad it worked out for you. Good tip about `Gentium:Italic` instead of `Gentium Italic`, too. Thanks for that. – Neil Mayhew Mar 30 '19 at 15:01
19

fc-query my-font.ttf will give you a map of supported glyphs and all the locales the font is appropriate for according to fontconfig

Since pretty much all modern linux apps are fontconfig-based this is much more useful than a raw unicode list

The actual output format is discussed here http://lists.freedesktop.org/archives/fontconfig/2013-September/004915.html

gavenkoa
  • 45,285
  • 19
  • 251
  • 303
nim
  • 2,345
  • 14
  • 13
15

Here is a POSIX[1] shell script that can print the code point and the character in a nice and easy way with the help of fc-match which is mentioned in Neil Mayhew's answer (it can even handle up to 8-hex-digit Unicode):

#!/bin/bash
for range in $(fc-match --format='%{charset}\n' "$1"); do
    for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
        n_hex=$(printf "%04x" "$n")
        # using \U for 5-hex-digits
        printf "%-5s\U$n_hex\t" "$n_hex"
        count=$((count + 1))
        if [ $((count % 10)) = 0 ]; then
            printf "\n"
        fi
    done
done
printf "\n"

You can pass the font name or anything that fc-match accepts:

$ ls-chars "DejaVu Sans"

Updated content:

I learned that subshell is very time consuming (the printf subshell in my script). So I managed to write a improved version that is 5-10 times faster!

#!/bin/bash
for range in $(fc-match --format='%{charset}\n' "$1"); do
    for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
        printf "%04x\n" "$n"
    done
done | while read -r n_hex; do
    count=$((count + 1))
    printf "%-5s\U$n_hex\t" "$n_hex"
    [ $((count % 10)) = 0 ] && printf "\n"
done
printf "\n"

Old version:

$ time ls-chars "DejaVu Sans" | wc
    592   11269   52740

real    0m2.876s
user    0m2.203s
sys     0m0.888s

New version (the line number indicates 5910+ characters, in 0.4 seconds!):

$ time ls-chars "DejaVu Sans" | wc
    592   11269   52740

real    0m0.399s
user    0m0.446s
sys     0m0.120s

End of update

Sample output (it aligns better in my st terminal ):

0020    0021 !  0022 "  0023 #  0024 $  0025 %  0026 &  0027 '  0028 (  0029 )
002a *  002b +  002c ,  002d -  002e .  002f /  0030 0  0031 1  0032 2  0033 3
0034 4  0035 5  0036 6  0037 7  0038 8  0039 9  003a :  003b ;  003c <  003d =
003e >  003f ?  0040 @  0041 A  0042 B  0043 C  0044 D  0045 E  0046 F  0047 G
...
1f61a 1f61b 1f61c 1f61d 1f61e 1f61f 1f620 1f621 1f622 1f623
1f625 1f626 1f627 1f628 1f629 1f62a 1f62b 1f62d 1f62e 1f62f
1f630 1f631 1f632 1f633 1f634 1f635 1f636 1f637 1f638 1f639
1f63a 1f63b 1f63c 1f63d 1f63e 1f63f 1f640 1f643

[1] Seems \U in printf is not POSIX standard?

Lu Xu
  • 311
  • 3
  • 6
  • 1
    #!/bin/sh => #!/bin/bash – vatosarmat Jan 23 '21 at 03:28
  • @vatosarmat, right, it should be something like bash, thanks. I guess the former works for me becuase the shell uses exectable `printf` instead of shell built-in. – Lu Xu Jan 24 '21 at 05:47
  • Correction to last comment: #!/bin/sh shebang does not work for me either, maybe I really haven't tried it. My bad. – Lu Xu Jan 24 '21 at 06:04
  • \U may require 6 characters; \u for 4 characters. This is fairly typical for programming languages (otherwise its ambiguous), although some things make be a bit lax. Makes a difference on Ubuntu 20.04 at least, where printf \U1f643 prints \u0001F643 (surrogate pair?), but \U01f643 returns – Cameron Kerr Mar 24 '21 at 12:01
  • @CameronKerr so adding '0's like "\U0$n_hex" in the printf line works for you on Ubuntu 20.04? – Lu Xu Mar 24 '21 at 12:20
  • Or, if \U requires at least 6 characters in your case, does that mean printf "\U0030" won't even work as desired? I really haven't tested the script on other systems than arch. – Lu Xu Mar 24 '21 at 12:25
  • 1
    Hmm, '\U0030' produces a '0', and '\U0030 ' produces '0 '. '\U0030a' produces '\u030a' (leading zeros, normalising to \u with 4 digits). However, as others have pointed out, this is bash builtin, not POSIX printf. /usr/bin/printf '\U0030' gives 'missing hexadecimal number in escape', and /usr/bin/printf '\u0030' gives 'invalid universal character name \u0030', but that's only because it should be specified as '0'. http://gnu-coreutils.7620.n7.nabble.com/usr-bin-printf-invalid-universal-character-name-td11992.html – Cameron Kerr Mar 24 '21 at 23:16
  • @CameronKerr, wow, thanks for all the research! That is more complicated than I anticipated, and the mailing list might be a bit beyond my knowledge :). From my understanding, the GNU's standalone printf program is more strict on what can follow '\u' and '\U' than the bash built-in? I can reproduce with /usr/bin/printf from coreutils package now, tho. – Lu Xu Mar 25 '21 at 03:37
  • I wanted to know how many characters were in a font, so here's a simplified oneliner of this answer that only counts characters: `for range in $(fc-match --format='%{charset}' "$1"); do seq "0x${range%-*}" "0x${range#*-}"; done | wc -l` – rebane2001 Apr 05 '22 at 08:56
13

The character code points for a ttf/otf font are stored in the CMAP table.

You can use ttx to generate a XML representation of the CMAP table. see here.

You can run the command ttx.exe -t cmap MyFont.ttf and it should output a file MyFont.ttx. Open it in a text editor and it should show you all the character code it found in the font.

jdhao
  • 24,001
  • 18
  • 134
  • 273
wschang
  • 448
  • 4
  • 14
  • Note that `ttx` is part of the `fonttools` mentioned in the accepted answer. It's a Python script, so it's also available on Mac and Linux. – mivk May 09 '19 at 12:37
  • 1
    You can make `ttx` show the output in STDOUT by using `-o -`. For example, `ttx -o - -t cmap myfont.ttf` will dump the content of the `cmap` table in the font `myfont.ttf` to STDOUT. You can then use it to see if a given character is defined in a given (e.g.`$ font ttx -o - -t cmap myfont.ttf | grep '5c81'`) – rdrg109 Jan 29 '22 at 17:34
5

I just had the same problem, and made a HOWTO that goes one step further, baking a regexp of all the supported Unicode code points.

If you just want the array of codepoints, you can use this when peeking at your ttx xml in Chrome devtools, after running ttx -t cmap myfont.ttf and, probably, renaming myfont.ttx to myfont.xml to invoke Chrome's xml mode:

function codepoint(node) { return Number(node.nodeValue); }
$x('//cmap/*[@platformID="0"]/*/@code').map(codepoint);

(Also relies on fonttools from gilamesh's suggestion; sudo apt-get install fonttools if you're on an ubuntu system.)

ecmanaut
  • 5,030
  • 2
  • 44
  • 66
3

To add to @Oliver Lew answer, I've added the option to query a local font instead of a system font:

#!/bin/bash

# If the first argument is a font file, use fc-match instead of fc-query to
# display the font
[[ -f "$1" ]] && fc='fc-query' || fc='fc-match'

for range in $($fc --format='%{charset}\n' "$1"); do
    for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
        printf "%04x\n" "$n"
    done
done | while read -r n_hex; do
    count=$((count + 1))
    printf "%-5s\U$n_hex\t" "$n_hex"
    [ $((count % 10)) = 0 ] && printf "\n"
done
printf "\n"
sergiuser
  • 53
  • 6
1

The above Janus's answer (https://stackoverflow.com/a/19438403/431528) works. But python is too slow, especially for Asian fonts. It costs minutes for a 40MB file size font on my E5 computer.

So I write a little C++ program to do that. It is depends on FreeType2(https://www.freetype.org/). It is a vs2015 project, but it is easy to port to linux for it is a console application.

Code can be found here, https://github.com/zhk/AllCodePoints For the 40MB file size Asian font, it costs about 30 ms on my E5 computer.

zhk_tiger
  • 51
  • 3
0

You can do this on Linux in Perl using the Font::TTF module.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
  • 2
    Yes, it should be possible. But it's a complex suite of modules, with miserable documentation. So without an example of how it could be done, this answer seems quite useless. – mivk Jul 17 '18 at 19:38
  • @mivk I would go as far as saying, this isn't an answer. – Hacker Mar 08 '23 at 23:43
0

If you ONLY want to "view" the fonts, the following might be helpful (if your terminal supports the font in question):

#!/usr/bin/env python
import sys
from fontTools.ttLib import TTFont

with TTFont(sys.argv[1], 0, ignoreDecompileErrors=True) as ttf:
    for x in ttf["cmap"].tables:
        for (_, code) in x.cmap.items():
            point = code.replace('uni', '\\u').lower()
            print("echo -e '" + point + "'")

An unsafe, but easy way to view:

python font.py my-font.ttf | sh

Thanks to Janus (https://stackoverflow.com/a/19438403/431528) for the answer above.

phuclv
  • 37,963
  • 15
  • 156
  • 475
hiddensunset4
  • 5,825
  • 3
  • 39
  • 61
0

If you want to get all characters supported by a font, you may use the following (based on Janus's answer)

from fontTools.ttLib import TTFont

def get_font_characters(font_path):
    with TTFont(font_path) as font:
        characters = {chr(y[0]) for x in font["cmap"].tables for y in x.cmap.items()}
    return characters
Bruno Degomme
  • 883
  • 10
  • 11
0

FreeType's project provides demo application, where one of the demos is called "ftdump". Then you can do: "ftdump -V path-to-the-font-file" and you will get what you are looking for. To view the source code, you can close the sources here: https://www.freetype.org/developer.html

On Ubuntu it can be installed with "sudo apt install freetype2-demos"

Note: Try "-c" instead of "-V". I see that args have changed between versions.

gatis paeglis
  • 541
  • 5
  • 7
0

There's a website that does it without command line / Perl / Python, etc.

https://fontdrop.info/

MarcinWolny
  • 1,600
  • 2
  • 27
  • 40