9

I have a array containing japanese caracters as well as "normal". How do I align the printout of these?

#!/usr/bin/python
# coding=utf-8

a1=['する', 'します', 'trazan', 'した', 'しました']
a2=['dipsy', 'laa-laa', 'banarne', 'po', 'tinky winky']

for i,j in zip(a1,a2):
    print i.ljust(12),':',j

print '-'*8

for i,j in zip(a1,a2):
    print i,len(i)
    print j,len(j)

Output:

する       : dipsy
します    : laa-laa
trazan       : banarne
した       : po
しました : tinky winky
--------
する 6
dipsy 5
します 9
laa-laa 7
trazan 6
banarne 7
した 6
po 2
しました 12
tinky winky 11

thanks, //Fredrik

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Fredrik Pihl
  • 44,604
  • 7
  • 83
  • 130

3 Answers3

6

Using the unicodedata.east_asian_width function, keep track of which characters are narrow and wide when computing the length of the string.

#!/usr/bin/python
# coding=utf-8

import sys
import codecs
import unicodedata

out = codecs.getwriter('utf-8')(sys.stdout)

def width(string):
    return sum(1+(unicodedata.east_asian_width(c) in "WF")
        for c in string)

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    out.write('%s %s: %s\n' % (i, ' '*(12-width(i)), j))

Outputs:

する          : dipsy
します        : laa-laa
trazan        : banarne
した          : po
しました      : tinky winky

It doesn’t look right in some web browser fonts, but in a terminal window they line up properly.

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
  • tab is not a solution, what I'm really are doing is generating sphinx-tables containing japanese verb conjugations. I'll check the east_asian_width function... – Fredrik Pihl Mar 19 '10 at 12:24
  • perfect, just what I was looking for in theory at least. Trying to run it though gives me this: $ ./try.py Traceback (most recent call last): File "./try.py", line 12, in print i,' '*(12-width(i)),':',j UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256) – Fredrik Pihl Mar 19 '10 at 12:34
  • @Fredrick Ouch, you might need to look at `sys.setdefaultencoding`. http://blog.ianbicking.org/illusive-setdefaultencoding.html – Josh Lee Mar 19 '10 at 12:37
  • extremely annoying, cant get it to work... >>> import sys >>> sys.getdefaultencoding() 'utf-8' can you pls post the complete code? – Fredrik Pihl Mar 19 '10 at 13:15
  • Ok, I think the correct solution is to not use the default encoding, but to explicitly encode every unicode string into the codec you want. See this question (http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python). OS X appears to have customized this problem away... – Josh Lee Mar 19 '10 at 13:55
2

Use unicode objects instead of byte strings:

#!/usr/bin/python
# coding=utf-8

a1=[u'する', u'します', u'trazan', u'した', u'しました']
a2=[u'dipsy', u'laa-laa', u'banarne', u'po', u'tinky winky']

for i,j in zip(a1,a2):
    print i.ljust(12),':',j

print '-'*8

for i,j in zip(a1,a2):
    print i,len(i)
    print j,len(j)

Unicode objects deal with characters directly.

jcdyer
  • 18,616
  • 5
  • 42
  • 49
  • using u'string' I get UnicodeEncodeError: 'latin-1' codec can't encode characters in position 0-1: ordinal not in range(256) solved by doing a print j.encoding('utf-8') but that seems extremely awkward... – Fredrik Pihl Mar 19 '10 at 12:21
  • @jleedev—My console says otherwise. Can you be more specific? What results are you getting? @Fredrik—Sounds like your terminal wants to use Latin-1 encoding. You'll have to find a way to convince it to use UTF-8, or write your output to a file instead of printing (I recommend `import codecs; f = codecs.open('output.txt', encoding='utf-8')`). Good luck! – jcdyer Mar 19 '10 at 15:08
  • @jleedev—Ah. I see what's going on. It depends on your font, to some extent, and there's nothing python can do about that, but it does fix the issue with the character counts in the second `for` loop. – jcdyer Mar 19 '10 at 15:12
1

You need to manually build the string and also manually build the format length. There is no easy way for this

The three functions below do this (needs unicodedata):

shortenStringCJK: correctly shorten to a length for fitting in some output (not length cut for getting X characters)

def shortenStringCJK(string, width, placeholder='..'):
# get the length with double byte charactes
string_len_cjk = stringLenCJK(str(string))
# if double byte width is too big
if string_len_cjk > width:
    # set current length and output string
    cur_len = 0
    out_string = ''
    # loop through each character
    for char in str(string):
        # set the current length if we add the character
        cur_len += 2 if unicodedata.east_asian_width(char) in "WF" else 1
        # if the new length is smaller than the output length to shorten too add the char
        if cur_len <= (width - len(placeholder)):
            out_string += char
    # return string with new width and placeholder
    return "{}{}".format(out_string, placeholder)
else:
    return str(string)

stringLenCJK: get correct length (as in space taken on a terminal)

def stringLenCJK(string):
    # return string len including double count for double width characters
    return sum(1 + (unicodedata.east_asian_width(c) in "WF") for c in string)

formatLen: format the length to adjust for width from double byte characters. without this one the length will be unbalanced.

def formatLen(string, length):
    # returns length udpated for string with double byte characters
    # get string length normal, get string length including double byte characters
    # then subtract that from the original length
    return length - (stringLenCJK(string) - len(string))

to then output some string: pre define the format string

format_str = "|{{:<{len}}}|"
format_len = 26
string_len = 26

and output as follows (where _string is the string to output)

print("Normal : {}".format(
    format_str.format(
        len=formatLen(shortenStringCJK(_string, width=string_len), format_len))
    ).format(
        shortenStringCJK(_string, width=string_len)
    )
)