How to control padding of Unicode string containing east Asia characters

Question

I got three UTF-8 stings:

hello, world
hello, 世界
hello, 世rld

I only want the first 10 ascii-char-width so that the bracket in one column:

[hello, wor]
[hello, 世 ]
[hello, 世r]

In console:

width('世界')==width('worl')
width('世 ')==width('wor')  #a white space behind '世'

One chinese char is three bytes, but it only 2 ascii chars width when displayed in console:

>>> bytes("hello, 世界", encoding='utf-8')
b'hello, \xe4\xb8\x96\xe7\x95\x8c'

python's format() doesn't help when UTF-8 chars mixed in

>>> for s in ['[{0:<{1}.{1}}]'.format(s, 10) for s in ['hello, world', 'hello, 世界', 'hello, 世rld']]:
...    print(s)
...
[hello, wor]
[hello, 世界 ]
[hello, 世rl]

It's not pretty:

 -----------Songs-----------
|    1: 蝴蝶                  |
|    2: 心之城                 |
|    3: 支持你的爱人              |
|    4: 根生的种子               |
|    5: 鸽子歌(CUCURRUCUCU PALO|
|    6: 林地之间                |
|    7: 蓝光                  |
|    8: 在你眼里                |
|    9: 肖邦离别曲               |
|   10: 西行( 魔戒王者再临主题曲)(INTO |
| X 11: 深陷爱河                |
| X 12: 钟爱大地(THE MO RUN AIR |
| X 13: 时光流逝                |
| X 14: 卡农                  |
| X 15: 舒伯特小夜曲(SERENADE)    |
| X 16: 甜蜜的摇篮曲(Sweet Lullaby|
 ---------------------------

So, I wonder if there is a standard way to do the UTF-8 padding staff?

You just added the text "One chinese char is three bytes, but it only 2 ascii chars width when displayed in console". As I showed in my answer, the number of *bytes* is irrelevant in determining how wide the character will appear in the font. And the width of a Chinese character cannot be measured in "ASCII characters" -- if you look carefully you'll probably see it's closer to 1.5 or 1.8 ASCII characters, not exactly 2. It is merely a matter of how wide, in pixels, is each character. In Python 3, you should almost never have to deal with the underlying bytes of a string; this is no exception. — mgiuca, Jan 07 '11 at 04:13

score 16 · Accepted Answer · answered Jan 08 '11 at 04:42

When trying to line up ASCII text with Chinese in fixed-width font, there is a set of full width versions of the printable ASCII characters. Below I made a translation table of ASCII to full width version:

# coding: utf8

# full width versions (SPACE is non-contiguous with ! through ~)
SPACE = '\N{IDEOGRAPHIC SPACE}'
EXCLA = '\N{FULLWIDTH EXCLAMATION MARK}'
TILDE = '\N{FULLWIDTH TILDE}'

# strings of ASCII and full-width characters (same order)
west = ''.join(chr(i) for i in range(ord(' '),ord('~')))
east = SPACE + ''.join(chr(i) for i in range(ord(EXCLA),ord(TILDE)))

# build the translation table
full = str.maketrans(west,east)

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)
'''

# Replace the ASCII characters with full width, and create a song list.
data = data.translate(full).rstrip().split('\n')

# translate each printable line.
print(' ----------Songs-----------'.translate(full))
for i,song in enumerate(data):
    line = '|{:4}: {:20.20}|'.format(i+1,song)
    print(line.translate(full))
print(' --------------------------'.translate(full))

Output

　－－－－－－－－－－Ｓｏｎｇｓ－－－－－－－－－－－
｜　　　１：　蝴蝶（Ａ　ｓｏｎｇ）　　　　　　　　　　｜
｜　　　２：　心之城（Ａｎｏｔｈｅｒ　ｓｏｎｇ）　　　｜
｜　　　３：　支持你的爱人（Ｙｅｔ　ａｎｏｔｈｅｒ　ｓ｜
｜　　　４：　根生的种子　　　　　　　　　　　　　　　｜
｜　　　５：　鸽子歌（Ｃｕｃｕｒｒｕｃｕｃｕ　ｐａｌｏ｜
｜　　　６：　林地之间　　　　　　　　　　　　　　　　｜
｜　　　７：　蓝光　　　　　　　　　　　　　　　　　　｜
｜　　　８：　在你眼里　　　　　　　　　　　　　　　　｜
｜　　　９：　肖邦离别曲　　　　　　　　　　　　　　　｜
｜　　１０：　西行（魔戒王者再临主题曲）（Ｉｎｔｏ　ｓ｜
｜　　１１：　深陷爱河　　　　　　　　　　　　　　　　｜
｜　　１２：　钟爱大地　　　　　　　　　　　　　　　　｜
｜　　１３：　时光流逝　　　　　　　　　　　　　　　　｜
｜　　１４：　卡农　　　　　　　　　　　　　　　　　　｜
｜　　１５：　舒伯特小夜曲（ＳＥＲＥＮＡＤＥ）　　　　｜
｜　　１６：　甜蜜的摇篮曲（Ｓｗｅｅｔ　Ｌｕｌｌａｂｙ｜
　－－－－－－－－－－－－－－－－－－－－－－－－－－

It's not overly pretty, but it lines up.

score 6 · Answer 2 · edited May 23 '17 at 12:02

There seems to be no official support for this, but a built-in package may help:

>>> import unicodedata
>>> print unicodedata.east_asian_width(u'中')

The returned value represents the category of the code point. Specifically,

W - East Asian Wide
F - East Asian Full-width (of narrow)
Na - East Asian Narrow
H - East Asian Half-width (of wide)
A - East Asian Ambiguous
N - Not East Asian

This answer to a similar question provided a quick solution. Note however, the display result depends on the exact monospaced font used. The default fonts used by ipython and pydev don't work well, while windows console is ok.

score 4 · Answer 3 · answered Jan 07 '11 at 03:54

4

Take a look at kitchen. I think it might have what you want.

answered Jan 07 '11 at 03:54

David Johnstone

24,300
14
68
71

Precisely! kitchen provides utilities to fill/chop a unicode string to the exact screen char width for monospace font. – Lord Mosuma Jul 11 '17 at 02:23

score 4 · Answer 4 · answered Jan 07 '11 at 04:10

Firstly, it looks like you're using Python 3, so I'll respond accordingly.

Maybe I'm not understanding your question, but it looks like the output you are getting is exactly what you want, except that Chinese characters are wider in your font.

So UTF-8 is a red herring, since we are not talking about bytes, we are talking about characters. You are in Python 3, so all strings are Unicode. The underlying byte representation (where each of those Chinese characters is represented by three bytes) is irrelevant.

You want to clip or pad each string to exactly 10 characters, and that is working correctly:

>>> len('hello, wor')
10
>>> len('hello, 世界 ')
10
>>> len('hello, 世rl')
10

The only problem is that you are looking at it with what appears to be a monospaced font, but which actually isn't. Most monospaced fonts have this problem. All the normal Latin characters have exactly the same width in this font, but the Chinese characters are slightly wider. Therefore, the three characters "世界 " take up more horizontal space than the three characters "wor". There isn't much you can do about this, aside from either a) getting a font which is truly monospaced, or b) calculating precisely how wide each character is in your font, and adding a number of spaces which approximately takes you to the same horizontal position (this will never be accurate).

Thanks for the explanation. Regarding your comment: "a) getting a font which is truly monospaced", can you recommend one? Is there not a single truly monospaced font that is widely available on all systems? — Sacha Guyer, Aug 14 '18 at 10:04
I don't know of one (it's probably too squashed to fit Chinese characters into the usual Latin space anyway). I think a better solution is that suggested in the accepted answer; translate all the Latin characters to full-width. — mgiuca, Aug 15 '18 at 07:48

score 3 · Answer 5 · answered Dec 20 '19 at 05:47

if you are working with English and Chinese characters, maybe this snippet can help you.

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)'''

width = 80

def get_aligned_string(string,width):
    string = "{:{width}}".format(string,width=width)
    bts = bytes(string,'utf-8')
    string = str(bts[0:width],encoding='utf-8',errors='backslashreplace')
    new_width = len(string) + int((width - len(string))/2)
    if new_width!=0:
        string = '{:{width}}'.format(str(string),width=new_width)
    return string

for i,line in enumerate(data.split('\n')):
    song = get_aligned_string(line,width)
    line = '|{:4}: {:}|'.format(i+1,song)
    print(line)

Output

score 1 · Answer 6 · answered Jan 28 '20 at 07:01

Here is a script based on unicodedata for detecting East-Asian characters and normalize them in to the NFC forms to ensure exact half/full-width matching. Normalization is required for Korean in macOS because macOS uses NFD forms and Korean characters are decomposed into individual syllables which are counted as characters in Python. (e.g., "가" is decomposed into two characters while "각" is decomposed into three characters, etc., while both they should be counted as double-width.)

It enumerates all files in the given root_path and displays whether the file names are in NFC or NFD forms.

#! /usr/bin/env python3
import unicodedata
from pathlib import Path


def len_ea(string: str) -> int:
    nfc_string = unicodedata.normalize('NFC', string)
    return sum((2 if unicodedata.east_asian_width(c) in 'WF' else 1) for c in nfc_string)


def align_string(string: str, width: int):
    nfc_string = unicodedata.normalize('NFC', string)
    num_wide_chars = sum(1 for c in nfc_string if unicodedata.east_asian_width(c) in 'WF')
    width = width - num_wide_chars
    return '{:{width}}'.format(nfc_string, width=width)


def show_filename_encodings(root_path: Path):
    outputs = []
    for p in root_path.glob("*"):
        nfc_name = unicodedata.normalize('NFC', p.name)
        nfd_name = unicodedata.normalize('NFD', p.name)
        if p.name == nfc_name:
            enc = "\033[94mNFC\033[0m"
        elif p.name == nfd_name:
            enc = "\033[91mNFD\033[0m"
        outputs.append((p.name, nfc_name, nfd_name, enc))

    # Take the NFC string to check the maximum length
    colw = max(len_ea(o[1]) for o in outputs) + 2
    for name, nfc_name, nfd_name, enc in outputs:
        print(f"{align_string(nfc_name, colw)}: {enc}")

score 0 · Answer 7 · answered Jul 23 '23 at 06:31

Here's another option that allows you to keep the original-width latin characters, as long as your destination (e.g., terminal) interprets ANSI escapes and displays double-width characters as twice the width as single-width characters.

It works by using two ANSI escapes: first \x1b[nG to move the cursor horizontally to the absolute column n (e.g., \x1b[10G moves to column 10), then \x1b[K to clear from the cursor to the end of the line.

data = '''\
蝴蝶(A song)
心之城(Another song)
支持你的爱人(Yet another song)
根生的种子
鸽子歌(Cucurrucucu palo whatever)
林地之间
蓝光
在你眼里
肖邦离别曲
西行（魔戒王者再临主题曲）(Into something)
深陷爱河
钟爱大地
时光流逝
卡农
舒伯特小夜曲(SERENADE)
甜蜜的摇篮曲(Sweet Lullaby)
'''

width = 40
title = "Songs"

move_to_column = f"\x1b[{width+2}G"  # +2 for borders
clear_line = "\x1b[K"  # clears from cursor to end of line

print(f" {title:-^{width}}")
for i, line in enumerate(data.splitlines(), 1):
    print(f"|{i:>5}: {line}{move_to_column}{clear_line}|")
print(" " + "-" * width)

Here's a screenshot of the output in a terminal:

rather than a fixed width, you could use the wcwidth package to calculate the maximum width in terms of terminal cells. Then add adjustment for padding, etc. — Andj, Jul 26 '23 at 23:58

How to control padding of Unicode string containing east Asia characters

7 Answers7

Output

Output

Linked