Python's .format() minilanguage and Unicode

Question

I'm trying to use some of the simple unicode characters in a command line program I'm writing, but drawing these things into a table becomes difficult because Python appears to be treating single-character symbols as multi-character strings.

For example, if I try to print(u"\u2714".encode("utf-8")) I see the unicode checkmark. However, if I try to add some padding to that character (as one might in tabular structure), Python seems to be interpreting this single-character string as a 3-character one. All three of these lines print the same thing:

print("|{:1}|".format(u"\u2714".encode("utf-8")))
print("|{:2}|".format(u"\u2714".encode("utf-8")))
print("|{:3}|".format(u"\u2714".encode("utf-8")))

Now I think I understand why this is happening: it's a multibyte string. My question is, how do I get Python to pad this string appropriately?

I'm currently working 2.7, but we need to support 3 as well. — Daniel Quinn, Oct 25 '15 at 17:54

score 2 · Answer 1 · answered Oct 25 '15 at 17:55

2

Make your format strings unicode:

from __future__ import print_function

print(u"|{:1}|".format(u"\u2714"))
print(u"|{:2}|".format(u"\u2714"))
print(u"|{:3}|".format(u"\u2714"))

outputs:

|✔|
|✔ |
|✔  |

answered Oct 25 '15 at 17:55

chucksmash

5,777
1
32
41

The print function is not required for this to work though. – poke Oct 25 '15 at 17:58
@poke You're correct. OP mentioned in a comment that he was specifically targeting Python 2.7 and 3+ so importing and using `unicode_literals`, `print_function` and `division` are all good practice if not required. – chucksmash Oct 25 '15 at 18:00
2

I absolutely agree with that :) My comment was more directed at another comment that has been removed since. – poke Oct 25 '15 at 18:03

score 1 · Accepted Answer · answered Oct 25 '15 at 17:54

1

Don't encode('utf-8') at that point do it latter:

>>> u"\u2714".encode("utf-8")
'\xe2\x9c\x94'

The UTF-8 encoding is three bytes long. Look at how format works with Unicode strings:

>>> u"|{:1}|".format(u"\u2714")
u'|\u2714|'
>>> u"|{:2}|".format(u"\u2714")
u'|\u2714 |'
>>> u"|{:3}|".format(u"\u2714")
u'|\u2714  |'

Tested on Python 2.7.3.

answered Oct 25 '15 at 17:54

Dan D.

73,243
15
104
123

Exactly what I needed! Thank you. – Daniel Quinn Oct 25 '15 at 18:04
@DanielQuinn: don't encode at all. [Print Unicode directly instead](http://stackoverflow.com/a/31110377/4279). Otherwise, your code may produce a mojibake if the environment uses a different character encoding. – jfs Oct 26 '15 at 10:26
@J.F.Sebastian If I don't encode, Python2.7 explodes with a `UnicodeEncodeError`. If I do, then Python 3 prints out `b'\xe2\x9c\x98'`. – Daniel Quinn Oct 26 '15 at 14:06
@DanielQuinn: If you have issues with printing Unicode then it is a different question (and hard-coding the character encoding is not the answer). Read the link from my previous comment. If you read the linked answer and you have failed to apply the solutions to your case then ask a separate question. – jfs Oct 26 '15 at 14:10

Python's .format() minilanguage and Unicode

2 Answers2