How can I get sensible results from len(), str.format() and a zero-width space?

Question

I'm trying to format text in a kind of table and write the result to a file, but I have problems with the alignment, because my source sometimes contains the Unicode character 'ZERO WIDTH SPACE' or \u200b in python. Consider the following code example:

str_list = ("a\u200b\u200b", "b", "longest entry\u200b")
format_str = "|{string:<{width}}| output of len(): {length}\n"

max_width = 0
for item in str_list:
    if len(item) > max_width:
        max_width = len(item)

with open("tmp", mode='w', encoding="utf-8") as file:
    for item in str_list:
        file.write(format_str.format(string=item,
                                     width=max_width,
                                     length=len(item)))

Content of 'tmp' after running above script:

|a           | output of len(): 3
|b             | output of len(): 1
|longest entry| output of len(): 14

So this looks like len() does not result in the 'printed width' of the string, and str.format() does not know how to handle zero width characters.

Or, this behavior is intentional and I need to do something else.

To be clear, I'm looking for a way to get something like this result:

|a            | output of len(): 1
|b            | output of len(): 1
|longest entry| output of len(): 13

I'd prefer if it's possible to do without mangling my source.

_"this looks like len() does not result in the 'printed width' of the string"_. Yes, I believe this is intended behavior. [How do I get the “visible” length of a combining Unicode string in Python?](https://stackoverflow.com/q/33351599/953482) may be of interest to you. — Kevin, Jan 19 '18 at 14:00
The width of the zero width space character depends on the font. You can use a font where it non-printing spaces display as a regular space. Or you can change the string `item = item.replace('\u200b', ' ')` — Håken Lid, Jan 19 '18 at 14:01
@Kevin The accepted answer from your link does not work here. The reported length for the strings match the result of len() — rhall, Jan 19 '18 at 14:21
@rhall, yeah, I thought that might be the case, since I don't think `ZERO WIDTH SPACE` is a "combining character". What's your opinion on the `wcwidth` project mentioned in the third post? — Kevin, Jan 19 '18 at 14:24
@HåkenLid Ok, so i'm not using a font where non-printing characters are printed and i assume that it is the intended way of 'displaying' non-printables. I'm trying to avoid replacing/regex solutions since there might be more characters i have to look for. See [zero width unicode chars](http://www.fileformat.info/info/unicode/char/search.htm?q=zero+width&han=Y&preview=entity), and i don't know how many more exist? — rhall, Jan 19 '18 at 14:32
@Kevin That looks promising, thanks for pointing that out. I'll give it a try. — rhall, Jan 19 '18 at 14:43

Zero Piraeus · Accepted Answer · 2018-01-22T16:49:39.743

The wcwidth package has a function wcswidth() which returns the width of a string in character cells:

from wcwidth import wcswidth

length = len('sneaky\u200bPete')      # 11
width = wcswidth('sneaky\u200bPete')  # 10

The difference between wcswidth(s) and len(s) can then be used to correct for the error introduced by str.format(). Modifying your code above:

from wcwidth import wcswidth

str_list = ("a\u200b\u200b", "b", "longest entry\u200b")
format_str = "|{s:<{fmt_width}}| width: {width}, error: {fmt_error}\n"

max_width = max(wcswidth(s) for s in str_list)

with open("tmp", mode='w', encoding="utf-8") as file:
    for s in str_list:
        width = wcswidth(s)
        fmt_error = len(s) - width
        fmt_width = max_width + fmt_error
        file.write(format_str.format(s=s,
                                     fmt_width=fmt_width,
                                     width=width,
                                     fmt_error=fmt_error))

… produces this output:

|a            | width: 1, error: 2
|b            | width: 1, error: 0
|longest entry| width: 13, error: 1

It also produces correct output for strings including double-width characters:

str_list = ("a\u200b\u200b", "b", "㓵", "longest entry\u200b")

|a            | width: 1, error: 2
|b            | width: 1, error: 0
|㓵           | width: 2, error: -1
|longest entry| width: 13, error: 1

Running `wcwidth.wcwidth` on a zero-width non-breaking space (U+FEFF) is giving 1 when it should be 0 (see [issue](https://github.com/jquast/wcwidth/issues/22)), so in my code I had to add a special case to set its width to 0. — wjandrea, Apr 14 '19 at 02:15

How can I get sensible results from len(), str.format() and a zero-width space?

1 Answers1