Unicode formatting

Question

I am working with string formatting. For english the formatting is neat but for unicode characters the formatting is haphazard. Can anyone please tell me the reason? Example:

form = u'{:<15}{:<3}({})'
a = [
 u'സി ട്രീമിം',
 u'ബി ഡോഗേറ്റ്',
 u'ജെ ഹോളണ്ട്',
 u'എം നസീർ ',
 u'എം ബസ്ചാഗൻ…',
 u'ടി ഹെഡ് ',
 u'കെ ഭാരത് ',
 u'എം സിറാജ് ',
 u'എ ഈശ്വരൻ ',
 u'സി ഹാൻഡ്‌സ്‌കോംബ് ബി',]

 for i in range(0, 10):
     print form.format(a[i][:12], 1, 2)

Gives output as

While

s = [
 u'abcdef',
 u'akash',
 u'rohit',
 u'anubhav',
 u'bhargav',
 u'achut',
 u'punnet',
 u'tom',
 u'rach',
 u'kamal'
 ]
for i in range(0, 10):
     print form.format(s[i][:12], 1, 2)

Gives:

Not all Unicode characters are created equal. Or at least, use equal width. — Martijn Pieters, Oct 22 '18 at 14:04

Martijn Pieters · Answer 1 · 2018-10-22T15:59:28.117

You are printing Malayalam Unicode codepoints, which uses a lot of vowel signs to modify the preceding glyph. These vowel sign codepoints that do not themselves form a new letter, and Malayalam doesn't produce the same regular width of output in a terminal as ASCII letters would.

For example, in your first string starts with U+0D38 MALAYALAM LETTER SA and U+0D3F MALAYALAM VOWEL SIGN I. The first, letter SA, takes a full position on the screen, but the second character, the vowel sign I, when preceding by SA, alters how the letter is printed. Note how with 2 codepoints printed, there is just one visible glyph:

>>> print u'\u0d38'  # letter SA
സ
>>> print u'\u0d3f'  # vowel sign I
 ി
>>> print u'\u0d38\u0d3f'  # both together
സി

The widths of Malayalam codepoints is also different; if you add ASCII letters below SA and vowel sign I, separately and combined, it looks like this:

>>> print u'\u0d38\nA..\n\u0d3f\nB..\n\u0d38\u0d3f\nAB.'  # with ASCII letters for size
സ
A..
 ി
B..
സി
AB.

Note how സ is wider than A (about 2.5 times as wide), while സി is almost as wide as 3 ASCII codepoints in fixed width! Not all Malayalam letters are this wide, however. The next letter in the first example is U+0D1F MALAYALAM LETTER TTA, which is much less wide:

>>> print u'\u0d38\nA..\n\u0d1f\nB..'
സ
A..
ട
B..

In practice, I'm hoping that the difference doesn't matter and codepoints are instead combined such that the output ends up roughly the same width.

Next, Malayalam has other combining characters too; your first string has U+0D4D MALAYALAM SIGN VIRAMA, which has been combined with the preceding letter TTA.

Diacritical marks, when combined with the preceding letter, play havoc with printing width:

>>> print u'\u0d1f\nA..\n\u0d4d\nB..\n\u0d1f\u0d4d\nAB.'
ട
A..
 ്
B..
ട്
AB.

The letter TTA is just as wide as an ASCII letter, and when you add the virama sign, the width didn't actually change.

You can approximate sizes by looking at the codepoint Unicode general categories. The unicodedata.category() function gives you the category as a string:

>>> import unicodedata
>>> unicodedata.category(u'\u0d38')
'Lo'
>>> unicodedata.category(u'\u0d3f')
'Mc'
>>> unicodedata.category(u'\u0d4d')
'Mn'

The letter SA is Lo (Letter, other), the vowel sign is Mc (Mark, spacing combining), and the virama sign is Mn (Mark, nonspacing).

>>> categories = {}
>>> for c in a[0]:
...     cat = unicodedata.category(c)
...     categories[cat] = categories.get(cat, 0) + 1
... 
>>> categories
{'Lo': 4, 'Mn': 1, 'Mc': 4, 'Zs': 1}

So for the first string, there are 4 letters, 4 combining marks, and the one vowel sign. The Zs category (Separator, space) is for the ' ' ASCII space character.

Can we get their widths predicted better if we skipped Mc and Mn characters? String a[0] would be 5 characters wide (4 times Lo and 1 space):

>>> print a[0] + '\nABCDE.'
സി ട്രീമിം
ABCDE.

In the browser, that doesn't look close enough, but in my iTerm terminal window it looks like this:

To get your lines to line up, you'd have to calculate the right width for your strings to add extra spaces for the difference in display width and the number of codepoints:

import unicodedata

def malayalam_width(s):
    return sum(1 for c in s if unicodedata.category(c)[0] != 'M')

form = u'{:<{width}}{:<3}({})'
for line in a:
    line = line[:12]
    adjust = len(line) - malayalam_width(line)
    print form.format(line, 1, 2, width=15 + adjust)

This improves the output a lot already:

It appears those wider letters do make a difference after all. You'd have to manually add further width for those to get a better result; with a mapping from letter to adjusted width you could get this to align a little better again. However, the codepoint widths are set by the font you use, and I'm not sure how easy it is to find a font that uses equal width for all Malayalam letters.

I find it much easier to just use tab stops, using

form = u'{:<{width}}\t{:<3}({})'
for line in a:
    line = line[:12]
    adjust = len(line) - malayalam_width(line)
    print form.format(line, 1, 2, width=12 + adjust)

Now the numbers do line up:

You do need to keep adjusting for widths; otherwise you end up at the wrong tab stop half the time.

Caveat: I'm not at all familiar with the Malayalam script, and I'm sure to have missed subtleties about how the various letters, vowel signs and diacritical marks interact. Someone who is more familiar with the script and Unicode codepoints is probably going to be able to produce a better width approximation function than I presented here.

I've also ignored the 2 U+200C ZERO WIDTH NON-JOINER codepoints that are currently present in your last string; you may want to remove those from your data. As it's name suggests, it has no width either.

Thank you for the great explanation – Savitha Suresh Nov 08 '18 at 14:24 — Savitha Suresh, Nov 08 '18 at 14:24

score -1 · Answer 2 · answered Oct 22 '18 at 16:24

-1

You could use the wcwidth module, it overcomes issues where tab length is interpreted differently in various terminals (as far as I know).

I used Python 3 here, I take it you're using 2, so your mileage may vary. Also, I modified the formatting of your output to demonstrate some of the variables in use

Solution

from wcwidth import wcswidth

a = [
    u'സി ട്രീമിം',
    u'ബി ഡോഗേറ്റ്',
    u'ജെ ഹോളണ്ട്',
    u'എം നസീർ ',
    u'എം ബസ്ചാഗൻ…',
    u'ടി ഹെഡ് ',
    u'കെ ഭാരത് ',
    u'എം സിറാജ് ',
    u'എ ഈശ്വരൻ ',
    u'സി ഹാൻഡ്‌സ്‌കോംബ് ബി'
]

desired = 15
max_str = 12

for item in a:

    sub_str = item[:max_str]

    diff = len(sub_str) - wcswidth(sub_str)

    indent = desired + diff if desired - wcswidth(sub_str) > 0 else desired + diff - 1

    form = u'{:<'+ str(indent) +'} {:<3}{:<3}{:<3}'

    print (form.format(sub_str, len(sub_str), wcswidth(sub_str), indent))

Result:

answered Oct 22 '18 at 16:24

Richard Dunn

6,165
1
25
36

Thank you for your answer, can you please explain why `else desired + diff - 1` this is to be done? – Savitha Suresh Nov 08 '18 at 14:24
Note: you can nest `{}` sections to specify a width in `str.format()` templates. Don't use string concatenation to build a template here. `u'{:<{indent}} {:<3}{:<3}{indent:<3}'` and `form.format(sub_str, len(sub_str), wcswidth(sub_str), indent=indent)` would work better. – Martijn Pieters Nov 08 '18 at 14:41
@MartijnPieters thanks, couldn't remember how to do that at the time. @Savitha Suresh I can't run it at the moment, but that little bit of math might not actually be necessary, I think just `else desired` should suffice. The general idea is that indentation should be added if there's a gap, otherwise not. – Richard Dunn Nov 08 '18 at 14:46
Note that `wcwidth` is not any better at this than using `unicodedata`. All that `wcswidth` does is give us the exact same info as what we already can glean from using [`unicodedata.east_asian_width`](https://docs.python.org/3/library/unicodedata.html#unicodedata.east_asian_width) and [`unicodedata.combining`](https://docs.python.org/3/library/unicodedata.html#unicodedata.combining) give us (the source code for the functions just replicates the Unicode data table for combining and EAW characters and gives you 0, 1 or 2 for a codepoint based on those tables. – Martijn Pieters Nov 08 '18 at 14:49
The `if desired - wcswidth(sub_str) > 0 else` logic makes no sense here; none of the strings have a longer width anyway, certainly not since there are no EAW codepoints here at all (nothing will take 2 blocks, everything takes 0 or 1 position). The use of `wcwidth` doesn't produce anything helpful here as it actively misses out on combining marks and over-estimates the lengths. I'm not sure how your terminal managed to produce the output you show in the screenshot, on my machine i get very different output even though the `wcswidth()` numbers are exactly the same. – Martijn Pieters Nov 08 '18 at 14:55
Your output actually shows a lot of combining marks not having been combined, so it may just be a case of an invalid Unicode font implementation or outdated Unicode rendering engine (that would be surprising, since Malayalam has been part of Unicode 1.1, released in 1993). – Martijn Pieters Nov 08 '18 at 14:57
At any rate, `wcswidth` is next to useless here, as these are all combining and single-width codepoints but their display output is variable in many fonts. – Martijn Pieters Nov 08 '18 at 14:58
What my iTerm2 OS X terminal shows running your code: https://i.stack.imgur.com/gTMof.png. Note the lack of alignment. – Martijn Pieters Nov 08 '18 at 15:00
When I tested it I was connected to an Amazon Linux machine from Win 10 using MobaXterm. Didn't have anything else at hand to try on. Not sure why it's not working on other terminals, I'm not an expert on wcwidth, but I'll take another look later. – Richard Dunn Nov 08 '18 at 15:08

Unicode formatting

2 Answers2

Solution

Result:

Linked