How can I make my program fully Unicode (or close to it) compatible?

Question

I wrote a Reddit bot that looks for posts that don't show up well on mobile and converts them to images. One issue I'm having is that it doesn't handle Unicode symbols well: http://www.reddit.com/r/pics/comments/2lgf08/mom_i_thought_you_were_taking_me_to_see_harry/clumkde http://www.reddit.com/r/mobilewizard/comments/2j62ix/html_entity_test_part_ii/cl8plni

As you can see, I can make basic HTML entities work (because I use HTMLParser to encode those entities into utf-8), but more fancy symbols don't. Is this a limitation of the Python Imaging Library, or is there something I can do? I thought converting to utf-8 would be sufficient. If it matters, the font I'm using is Courier New.

All the code is here:

from PIL import Image, ImageDraw, ImageFont
from cStringIO import StringIO

import HTMLParser

def str_to_img(str):
"""Converts a given string to a PNG image, and saves it to the return variable"""
# use 12pt Courier New for ASCII art
font = ImageFont.truetype("cour.ttf", 12)

# do some string preprocessing
str = str.replace("\n\n", "\n") # Reddit requires double newline for new line, don't let the bot do this
h = HTMLParser.HTMLParser()
str = h.unescape(str).encode('utf-8') # convert HTML entities to plain text

# create a placeholder image to determine correct image
img = Image.new('RGB', (1,1))
d = ImageDraw.Draw(img)

str_by_line = str.split("\n")
num_of_lines = len(str_by_line)

line_widths = []
for i, line in enumerate(str_by_line):
    line_widths.append(d.textsize(str_by_line[i], font=font)[0])
line_height = d.textsize(str, font=font)[1]     # the height of a line of text should be unchanging

img_width = max(line_widths)                                    # the image width is the largest of the individual line widths
img_height = num_of_lines * line_height             # the image height is the # of lines * line height

# creating the output image
# add 5 pixels to account for lowercase letters that might otherwise get truncated
img = Image.new('RGB', (img_width, img_height + 5), 'white')
d = ImageDraw.Draw(img)

for i, line in enumerate(str_by_line):
    d.text((0,i*line_height), line, font=font, fill='black')
output = StringIO()
img.save(output, format='PNG')

return output

Have you looked at this http://stackoverflow.com/questions/18729148/unicode-characters-not-rendering-with-pil-imagefont#comment27601679_18729512? He seemed to have the same problem you're having. — Daniel Le, Jan 28 '15 at 05:11
The problem is he is converting the symbols into unicode /u characters, which I can't do because I'm taking them in directly. Also, I have no idea how to add 'u' in front of a string when it is a variable. For example, "¥" as an input will fail. — itsmichaelwang, Jan 28 '15 at 06:53
I tried changing `str = h.unescape(str).encode('utf-8')` to `str = h.unescape(str).decode('utf-8')` and str_to_img("¥") outputs the correct image. — Daniel Le, Jan 28 '15 at 11:11
@Zapurdead: your question title is too broad. Unicode is a *huge* standard. Your question is much much simpler. It is: how to convert bytes to Unicode -- the answer: use `.decode()` method: `unicode_text = bytestring.decode(character_encoding)`. You should understand the difference between bytes and Unicode (write your bot using Python 3 that forces you to solve Unicode issues earlier). — jfs, Jan 28 '15 at 13:32
Here's [Python script that draws all named Unicode codepoints](https://gist.github.com/zed/07a8175ba07f393a6004#file-text2image-py) — jfs, Jan 28 '15 at 15:37
So I upgrading the bot to Python 3.4 (it was running 2.7 before) and a large portion of characters are now showing up correctly. I'll try some of the solutions here and let you know if they work for the rest of the characters. — itsmichaelwang, Jan 28 '15 at 17:45

How can I make my program fully Unicode (or close to it) compatible?

0 Answers0