Removing non latin characters from a string in Python3

Question

I'm passing a string to PIL's multiline_text() which, for some reason, doesn't support utf-8 but only Latin characters.

from PIL import Image, ImageDraw

input_string = "‘Hi’"

img_width = 500
img_height = 500
img = Image.new('RGB', (img_width, img_height), (255, 255, 255))
img_D = ImageDraw.Draw(img)
img_D.multiline_text((0, 0), input_string)    # <- bug here
img.save("test_img.jpeg", 'jpeg', optimize=True, quality = 200)

I get this error message: UnicodeEncodeError: 'latin-1' codec can't encode character '\u2018' in position 0: ordinal not in range(256)

So I need to get rid of all non-Latin characters. How can I do that?

Note: I've seen this answer, namely input_string = regex.sub(ur'[^\p{Latin}]', u'', t1) but I'm pretty sure it's for Python2, I get the following error: SyntaxError: invalid syntax. If I remove either the u or the r or both, I get error: bad escape \p.

François M. · Answer 1 · 2020-05-29T03:12:53.793

0

Found a partial way:

input_string = "‘Hi’"

input_string = input_string.encode("latin", "ignore") 
# you can also try "replace" but in the case of ‘ and ’ it doesn't work as it outputs b'?Hi?'

print(input_string)
# b'Hi' is a bytes object, to turn it back into a string, do:

input_string = input_string.decode("utf-8", "ignore") 
# same thing, you can also try "replace", but it might lose you some characters

print(input_string)
# Hi

However, on longer strings, it also strips accented characters (éèàù) so it's not perfect...

edited May 29 '20 at 03:12

answered May 29 '20 at 03:01

François M.

4,027
11
30
81

1

This approach is okay. Alternatively, if you use a font with support for unicode characters, you will not need to strip these characters from a unicode string. – Jeffrey Wilges May 29 '20 at 03:11
It's actually not, I lose accented characters. By "if I use a font", you mean in PIL? Can you point me to a (monospaced) font which supports unicode characters? – François M. May 29 '20 at 03:14
Hi again, based on this comment you might want to edit your question to indicate that you are asking about how to render non-latin characters. I have provided an answer below to help you work through loading and rendering with a font that supports unicode characters. – Jeffrey Wilges May 29 '20 at 19:50

score 0 · Answer 2 · answered May 29 '20 at 19:48

Since your comment clarified your question is about rendering unicode characters rather than stripping unicode characters, I will provide an example of loading and drawing with a font that supports unicode characters.

For this example I used Roboto Mono Bold which you can download from Google Fonts.

from PIL import Image, ImageDraw, ImageFont

haiku = '''bruits de neige et d’encre
frôlement d’âmes et d’ailes
deux papillons s’aiment'''

font = ImageFont.truetype('RobotoMono-Bold.ttf', 18)
haiku_dimensions = font.getsize_multiline(haiku)

padding = 4
with Image.new('RGB', tuple(d + (4 * padding) for d in haiku_dimensions), color=(180, 180, 180)) as image:
    draw = ImageDraw.Draw(image)
    draw.multiline_text((padding, padding), haiku, font=font, fill=(20,20,20))
    image.save('output.png')

The output looks like this:

Removing non latin characters from a string in Python3

2 Answers2