Generate character images with a font whose name cannot be correctly decoded

Question

I am creating images of Chinese seal script. I have three true type fonts for this task (Jin_Wen_Da_Zhuan_Ti.7z, Zhong_Guo_Long_Jin_Shi_Zhuan.7z, Zhong_Yan_Yuan_Jin_Wen.7z, for testing purpose only). Below are the appearances in Microsoft Word

appearance in Word

of the Chinese character "我" (I/me). Here is my Python script:

import numpy as np
from PIL import Image, ImageFont, ImageDraw, ImageChops
import itertools
import os


def grey2binary(grey, white_value=1):
    grey[np.where(grey <= 127)] = 0
    grey[np.where(grey > 127)] = white_value
    return grey


def create_testing_images(characters,
                          font_path,
                          save_to_folder,
                          sub_folder=None,
                          image_size=64):
    font_size = image_size * 2
    if sub_folder is None:
        sub_folder = os.path.split(font_path)[-1]
        sub_folder = os.path.splitext(sub_folder)[0]
    sub_folder_full = os.path.join(save_to_folder, sub_folder)
    if not os.path.exists(sub_folder_full):
        os.mkdir(sub_folder_full)
    font = ImageFont.truetype(font_path,font_size)
    bg = Image.new('L',(font_size,font_size),'white')

    for char in characters:
        img = Image.new('L',(font_size,font_size),'white')
        draw = ImageDraw.Draw(img)
        draw.text((0,0), text=char, font=font)
        diff = ImageChops.difference(img, bg)
        bbox = diff.getbbox()
        if bbox:
            img = img.crop(bbox)
            img = img.resize((image_size, image_size), resample=Image.BILINEAR)

            img_array = np.array(img)
            img_array = grey2binary(img_array, white_value=255)

            edge_top = img_array[0, range(image_size)]
            edge_left = img_array[range(image_size), 0]
            edge_bottom = img_array[image_size - 1, range(image_size)]
            edge_right = img_array[range(image_size), image_size - 1]

            criterion = sum(itertools.chain(edge_top, edge_left, 
                                           edge_bottom, edge_right))

            if criteria > 255 * image_size * 2:
                img = Image.fromarray(np.uint8(img_array))
                img.save(os.path.join(sub_folder_full, char) + '.gif')

where the core snippet is

        font = ImageFont.truetype(font_path,font_size)
        img = Image.new('L',(font_size,font_size),'white')
        draw = ImageDraw.Draw(img)
        draw.text((0,0), text=char, font=font)

For example, if you put those fonts in the folder ./fonts, and call it with

create_testing_images(['我'], 'fonts/金文大篆体.ttf', save_to_folder='test')

the script will create ./test/金文大篆体/我.gif in your file system.

Now the problem is, though it works well with the first font 金文大篆体.ttf (in Jin_Wen_Da_Zhuan_Ti.7z), the script does not work on the other two fonts, even if they can be rendered correctly in Microsoft Word: for 中國龍金石篆.ttf (in Zhong_Guo_Long_Jin_Shi_Zhuan.7z), it draws nothing so bbox will be None; for 中研院金文.ttf (in Zhong_Yan_Yuan_Jin_Wen.7z), it will draw a black frame with no character in the picture.

enter image description here

and thus fails to pass the test of criterion, whose purpose is for testing an all-black output. I used FontForge to check the properties of the fonts, and found that the first font 金文大篆体.ttf (in Jin_Wen_Da_Zhuan_Ti.7z) uses UnicodeBmp

UnicodeBmp

while the other two use Big5hkscs

Big5hkscs_中國龍金石篆中研院金文

which is not the encoding scheme of my system. This may be the reason that the font names are unrecognizable in my system:

font viewer

Actually I also try to solve this by trying to get the font with the messy font name. I tried pycairo after installing those fonts:

import cairo

# adapted from
# http://heuristically.wordpress.com/2011/01/31/pycairo-hello-world/

# setup a place to draw
surface = cairo.ImageSurface(cairo.FORMAT_ARGB32, 100, 100)
ctx = cairo.Context (surface)

# paint background
ctx.set_source_rgb(1, 1, 1)
ctx.rectangle(0, 0, 100, 100)
ctx.fill()

# draw text
ctx.select_font_face('金文大篆体')
ctx.set_font_size(80)
ctx.move_to(12,80)
ctx.set_source_rgb(0, 0, 0)
ctx.show_text('我')

# finish up
ctx.stroke() # commit to surface
surface.write_to_png('我.gif')

This works well again with 金文大篆体.ttf (in Jin_Wen_Da_Zhuan_Ti.7z):

enter image description here

but still not with others. For example: neither ctx.select_font_face('中國龍金石篆') (which reports _cairo_win32_scaled_font_ucs4_to_index:GetGlyphIndicesW) nor ctx.select_font_face('¤¤°êÀsª÷¥Û½f') (which draws with the default font) works. (The latter name is the messy code displayed in the font viewer as shown above, obtained by a line of Mathematica code ToCharacterCode["中國龍金石篆", "CP950"] // FromCharacterCode where CP950 is the code page of Big5.)

So I think I've tried my best to tackle this issue, but still cannot solve it. I've also come up with other ways like renaming the font name with FontForge or changing the system encoding to Big5, but I would still prefer a solution that involves Python only and thus needs less additional actions from the user. Any hints will be greatly appreciated. Thank you.

To the moderators of stackoverflow: this problem may seem "too localized" at its first glance, but it could happen in other languages / other encodings / other fonts, and the solution can be generalized to other cases, so please don't close it with this reason. Thank you.

UPDATE: Weirdly Mathematica can recognize the font name in CP936 (GBK, which can be thought of as my system encoding). Take 中國龍金石篆.ttf (in Zhong_Guo_Long_Jin_Shi_Zhuan.7z) for an example:

Mathematica

But setting ctx.select_font_face('ÖÐøý½ðÊ¯*') does not work either, which will create the character image with the default font.

Can't help you with that, but take my +1 for a well written and thorough researched question. — georg, Jun 02 '13 at 21:04
You might want to consider specifying the `encoding` parameter like `ImageFont.truetype(font_path,font_size,encoding="big5")`. — Silvia, Jun 03 '13 at 07:01
@Silvia thanks for telling me the parameter, but unfortunately it doesn't work. — Ziyuan, Jun 03 '13 at 17:05

Aya · Accepted Answer · 2013-06-11T17:25:54.467

Silvia's comment on the OP...

You might want to consider specifying the encoding parameter like ImageFont.truetype(font_path,font_size,encoding="big5")

...gets you halfway there, but it looks like you also have to manually translate the Unicode characters if you're not using a Unicode font.

For the fonts which use "big5hkscs" encoding, I had to do this...

>>> u = u'\u6211'      # Unicode for 我
>>> u.encode('big5hkscs')
'\xa7\xda'

...then use u'\ua7da' to get the right glyph, which is a bit weird, but it looks to be the only way to pass a multi-byte character to PIL.

The following code works for me on both Python 2.7.4 and Python 3.3.1, with PIL 1.1.7...

from PIL import Image, ImageDraw, ImageFont


# Declare font files and encodings
FONT1 = ('Jin_Wen_Da_Zhuan_Ti.ttf',          'unicode')
FONT2 = ('Zhong_Guo_Long_Jin_Shi_Zhuan.ttf', 'big5hkscs')
FONT3 = ('Zhong_Yan_Yuan_Jin_Wen.ttf',       'big5hkscs')


# Declare a mapping from encodings used by str.encode() to encodings used by
# the FreeType library
ENCODING_MAP = {'unicode':   'unic',
                'big5':      'big5',
                'big5hkscs': 'big5',
                'shift-jis': 'sjis'}


# The glyphs we want to draw
GLYPHS = ((FONT1, u'\u6211'),
          (FONT2, u'\u6211'),
          (FONT3, u'\u6211'),
          (FONT3, u'\u66ce'),
          (FONT2, u'\u4e36'))


# Returns PIL Image object
def draw_glyph(font_file, font_encoding, unicode_char, glyph_size=128):

    # Translate unicode string if necessary
    if font_encoding != 'unicode':
        mb_string = unicode_char.encode(font_encoding)
        try:
            # Try using Python 2.x's unichr
            unicode_char = unichr(ord(mb_string[0]) << 8 | ord(mb_string[1]))
        except NameError:
            # Use Python 3.x-compatible code
            unicode_char = chr(mb_string[0] << 8 | mb_string[1])

    # Load font using mapped encoding
    font = ImageFont.truetype(font_file, glyph_size, encoding=ENCODING_MAP[font_encoding])

    # Now draw the glyph
    img = Image.new('L', (glyph_size, glyph_size), 'white')
    draw = ImageDraw.Draw(img)
    draw.text((0, 0), text=unicode_char, font=font)
    return img


# Save an image for each glyph we want to draw
for (font_file, font_encoding), unicode_char in GLYPHS:
    img = draw_glyph(font_file, font_encoding, unicode_char)
    filename = '%s-%s.png' % (font_file, hex(ord(unicode_char)))
    img.save(filename)

Note that I renamed the font files to the same names as the 7zip files. I try to avoid using non-ASCII characters in code examples, since they sometimes get screwed up when copy/pasting.

This example should work fine for the types declared in ENCODING_MAP, which can be extended if needed (see the FreeType encoding strings for valid FreeType encodings), but you'll need to change some of the code in cases where the Python str.encode() doesn't produce a multi-byte string of length 2.

Update

If the problem is in the ttf file, how could you find the answer in the PIL and FreeType source code? Above, you seem to be saying PIL is to blame, but why should one have to pass unicode_char.encode(...).decode(...) when you just want unicode_char?

As I understand it, the TrueType font format was developed before Unicode became widely adopted, so if you wanted to create a Chinese font back then, you'd have to have used one of the encodings which was in use at the time, and China had mostly been using Big5 since the mid 1980s.

It stands to reason, then, that there had to be a way to retrieve glyphs from a Big5-encoded TTF using the Big5 character encodings.

The C code for rendering a string with PIL starts with the font_render() function, and ultimately calls FT_Get_Char_Index() to locate the correct glyph, given the character code as an unsigned long.

However, PIL's font_getchar() function, which produces that unsigned long only accepts Python string and unicode types, and since it doesn't seem to do any translation of the character encodings itself, it seemed that the only way to get the correct value for the Big5 character set was to coerce a Python unicode character into the correct unsigned long value by exploiting the fact that u'\ua7da' was stored internally as the integer 0xa7da, either in 16 bits or 32 bits, depending on how you compiled Python.

TBH, there was a fair amount of guesswork involved, since I didn't bother to investigate what exactly the effect of ImageFont.truetype()'s encoding parameter is, but by the looks of it, it's not supposed to do any translation of character encodings, but rather to allow a single TTF file to support multiple character encodings of the same glyphs, using the FT_Select_Charmap() function to switch between them.

So, as I understand it, the FreeType library's interaction with the TTF files works something like this...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

class TTF(object):
    glyphs = {}
    encoding_maps = {}

    def __init__(self, encoding='unic'):
        self.set_encoding(encoding)

    def set_encoding(self, encoding):
        self.current_encoding = encoding

    def get_glyph(self, charcode):
        try:
            return self.glyphs[self.encoding_maps[self.current_encoding][charcode]]
        except KeyError:
            return ' '


class MyTTF(TTF):
    glyphs = {1: '我',
              2: '曎'}
    encoding_maps = {'unic': {0x6211: 1, 0x66ce: 2},
                     'big5': {0xa7da: 1, 0x93be: 2}}


font = MyTTF()
print 'Get via Unicode map: %s' % font.get_glyph(0x6211)
font.set_encoding('big5')
print 'Get via Big5 map: %s' % font.get_glyph(0xa7da)

...but it's up to each TTF to provide the encoding_maps variable, and there's no requirement for a TTF to provide one for Unicode. Indeed, it's unlikely that a font created prior to the adoption of Unicode would have.

Assuming all that is correct, then there's nothing wrong with the TTF - the problem is just with PIL making it a little awkward to access glyphs for fonts which don't have a Unicode mapping, and for which the required glyph's unsigned long character code is greater than 255.

The FONT2 and FONT3 are actually `big5hkscs` fonts which is extended `big5`. Though FreeType does not recognize `big5hkscs` it seems to accept the extended code. In relate to the op's comment to my post I successfully rendered 曎 (\u66ce, \x93be in big5hkscs, undefined in big5) using Zhong_Yan_Yuan_Jin_Wen font by passing `big5` to `truetype()`, and `big5hkscs` to `encode()` method. — Kenji Noguchi, Jun 07 '13 at 02:49
@KenjiNoguchi Ah. I didn't test the other examples. Have updated the code to cope with those. — Aya, Jun 07 '13 at 10:34
Thanks! This also works for me. For one thing, I think using `'我'` instead of `u'\u6211'` is OK (at least under Python 3) because both results of `encode('big5hkscs')` are the same. For another, the "exclusive or" of the glyphs from this method and from @Kenji Noguchi's with the font `Zhong_Yan_Yuan_Jin_Wen.ttf` is not empty (for example, for character set of GB2312). The reason might be that the font could not correctly handle the characters that are not included in the font. I need some investigation before giving the bounty. — Ziyuan, Jun 09 '13 at 09:43
@ziyuang You can use `'我'` instead of `u'\u6211'` (or just `'\u6211'` in Python 3.x - the `u` prefix is redundant) as long as you tell Python what encoding your source code uses, and you're certain your text editor will save the source code in the same encoding that you specify. I only used `u'\u6211'` in the example because: a) I don't know what encoding your text editor uses, and b) I don't know what Python version you're using, so I wanted to be sure that the example worked in any encoding with any version of Python. — Aya, Jun 09 '13 at 15:19
It appears the `try...except` block can be replaced (in both Python2 and Python3) with `unicode_char = unicode_char.encode(font_encoding).decode('utf_16_be')`. — unutbu, Jun 11 '13 at 02:53
@unutbu Almost, but it would fail for `'觤'` (`u'\u89e4'`), which encodes in `big5hkscs` to `'\xdf\xfe'`, which is an invalid UTF16-encoded string. The code `u'\u89e4'.encode('big5hkscs').decode('utf_16_be')` raises `UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1: unexpected end of data`. What's really needed is a UCS-2 decoder, but it doesn't look like there is one. — Aya, Jun 11 '13 at 13:18
Thanks for the info, Aya. How did you find this counterexample, `u'\u89e4'`? — unutbu, Jun 11 '13 at 13:38
@unutbu Well, the [WP page for UTF-16](http://en.wikipedia.org/wiki/UTF-16) says that code points U+D800 to U+DFFF are reserved for UTF-16 encoding, and a [table of big5 character encodings](http://ash.jp/code/cn/big5tbl.htm) shows several glyphs within this range - I just copy/pasted one into Python to get its Unicode value. It's not normally an issue, because those code points are reserved anyway, so, in most cases, UTF-16 and UCS-2 are interchangeable. It's only an issue here because the interface between Python and PIL only accepts Python `string` or `unicode` objects. — Aya, Jun 11 '13 at 13:49
Okay, fair enough. Then how did you know UCS-2 is the proper decoder? — unutbu, Jun 11 '13 at 14:04
@unutbu By reading some of the source code for PIL and the FreeType library. — Aya, Jun 11 '13 at 14:31
Aya, please correct me as I'm clearly misunderstanding something. PIL seems to work as expected when passed unicode. The error is in the way the ttf files map unicode to glyphs -- it is associating the wrong unicode to the glyphs, and this is why we need something like `.encode(font_encoding).decode('ucs-2')` to work around the mismapping. If the problem is in the ttf file, how could you find the answer in the PIL and FreeType source code? Above, you seem to be saying PIL is to blame, but why should one have to pass `unicode_char.encode(...).decode(...)` when you just want `unicode_char`? — unutbu, Jun 11 '13 at 14:52

score 4 · Answer 2 · answered Jun 05 '13 at 09:07

4

The problem is the fonts not strictly conforming to the TrueType specification. A quick solution is to use FontForge (you are using it already), and let it sanitize the fonts.

Open a font file
Go to Encoding, then select Reencode
Choose ISO 10646-1 (Unicode BMP)
Go to File then Generate Fonts
Save as TTF
Run your script with the newly generated fonts
Voila! It prints 我 in beautiful font!

answered Jun 05 '13 at 09:07

Kenji Noguchi

1,752
2
17
26

Unfortunately this re-encoding seems to introduce some errors to the fonts. Would you please try the character '曎' with the font 中研院金文.ttf in Zhong_Yan_Yuan_Jin_Wen.7z? Originally this character is not considered in the font. But after re-encoding, it is rendered as something else. And this is not the only case when I was testing characters from GBK – Ziyuan Jun 06 '13 at 11:29
You can also try "丶" with 中國龍金石篆.ttf in Zhong_Guo_Long_Jin_Shi_Zhuan.7z. I believe the re-encoding has some problems when dealing with the characters that are not originally considered in the font. – Ziyuan Jun 06 '13 at 11:48
曎 is located at 0x93be in 中研院金文.ttf(big5hkscs). It's located at 0x66ce in the converted font (unicode). It's correct. – Kenji Noguchi Jun 07 '13 at 04:00
Same for 丶. It's located at 0xc6bf in 中國龍金石篆.ttf(big5hkscs). It's located at 0x4e36 in the converted Unicode font. Why do you think they should not be in the font? – Kenji Noguchi Jun 07 '13 at 04:18
1

Here "correct" means Python2.7's Unicode-big5hkscs mapping and FontForge reencoding match. – Kenji Noguchi Jun 07 '13 at 05:06
My judge of "incorrect" comes from visual checking: the image of "曎" looks very different from "曎" with the re-encoded 中研院金文.ttf. For me, it's very similar to the image of "鐘". Actually, in Microsoft Word, "曎" and "丶" have no correspondences in the original 中研院金文.ttf and 中國龍金石篆.ttf, respectively. Here "no correspondence" means the looking of the character does not change after specifying the font of interest. And according to the beginning of the question, I assume Word can cope with fonts well. – Ziyuan Jun 07 '13 at 20:57
It does look different from 曎 but I don't know Chinese so I can't tell. It could be font bug. – Kenji Noguchi Jun 07 '13 at 23:03
Your solution and @Aya's solution works equally well for me. I appreciate the simplicity in your solution. But I somehow prefer his/hers, because it does not require modifications to the font, and also it exposes more internal mechanism for me (like the FreeType encodings). So I would like to give the bounty to him/her. Thanks again for the neat answer! – Ziyuan Jun 09 '13 at 16:51
I like @Aya's answer, too, for the same reasons. In the future all legacy encodings will be deprecated by Unicode IMHO. @Aya's solution already exposed some short comings of PIL and FreeType for big5/big5hkscs support. To pass a `big5` encoded string as Unicode object like `u'\ua7da'` is a weird hack as she said! – Kenji Noguchi Jun 09 '13 at 23:53
1

@KenjiNoguchi I think that Unicode already has superceded legacy encodings in most cases. The problem here is backward compatibility for fonts created before that happened. Reading through the metadata for the three font files, they were originally created in 2007, 1994, and 1997 respectively, from which your could infer that the majority of systems switched from Big5 to Unicode somewhere between 1997 and 2007, and given Microsoft's dominance in the OS market, [this WP page](http://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows) would tend to agree. See also the recent update to my answer. – Aya Jun 11 '13 at 20:37

Generate character images with a font whose name cannot be correctly decoded

2 Answers2