15

In Python 3, Unicode strings are supposed to kindly give you the number of Unicode characters, but I can't figure out how to get the final display width of a string given that some characters combine.

Genesis 1:1 -- בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ

>>> len('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ')
60

But the string is only 37 characters wide. Normalization doesn't solve the problem because the vowels (dots underneath the larger characters) are distinct characters.

>>> len(unicodedata.normalize('NFC', 'בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ'))
60

As a side note: the textwrap module is totally broken in this regard, aggressively wrapping where it shouldn't. str.format seems similarly broken.

Community
  • 1
  • 1
Conley Owens
  • 8,691
  • 5
  • 30
  • 43

2 Answers2

6

The problem is the combining characters, which Python counts as distinct when computing __len__, but merge into a single printed character.

To find out whether a character is a combining character, we can use the unicodedata module:

unicodedata.combining(unichr)

Returns the canonical combining class assigned to the Unicode character unichr as integer. Returns 0 if no combining class is defined.

A naive solution is to just strip out any characters with a non-zero combining class. This leaves characters that stand on their own, and should give us a string with a 1-to-1 mapping between visible and underlying characters. (I am a Unicode novice, and it’s probably more complicated than that. There are subtleties with combining characters and grapheme extenders which I don’t really understand, but don’t seem to matter for this particular string.)

So I came up with this function:

import unicodedata

def visible_length(unistr):
    '''Returns the number of printed characters in a Unicode string.'''
    return len([char for char in unistr if unicodedata.combining(char) == 0])

which returns the correct length for your string:

>>> visible_length('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ')
37

This is probably not a complete solution for all Unicode strings, but depending on what subset of Unicode you’re working with, this may be enough for your needs.

Community
  • 1
  • 1
alexwlchan
  • 5,699
  • 7
  • 38
  • 49
  • 3
    If you need the full Unicode grapheme cluster segmentation algorithm or line-splitting then that's a bit more complicated—see third-party modules such as uniseg. – bobince Jun 17 '15 at 10:15
  • +1. This had occured to me, but when I played around with unicodedata.combining and saw that it returned a wide range of values, I got pretty intimidated, but maybe it's suitable for my purposes. Thanks. Hopefully someone can propose an even more robust solution. – Conley Owens Jun 17 '15 at 15:03
5

A couple of solutions using the third party uniseg, as suggested by @bobince:

>>> from uniseg.graphemecluster import grapheme_cluster_breakables
>>> sum(grapheme_cluster_breakables('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ'))
37
>>>
>>> from uniseg.graphemecluster import grapheme_clusters
>>> list(grapheme_clusters('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְ  הָאָרֶץ'))
['בְּ', 'רֵ', 'א', 'שִׁ', 'י', 'ת', ',', ' ', 'בָּ', 'רָ', 'א', ' ', 'אֱ', 'לֹ', 'הִ', 'י', 'ם', ',', ' ', 'אֵ', 'ת', ' ', 'הַ', 'שָּׁ', 'מַ', 'יִ', 'ם', ',', ' ', 'וְ', 'אֵ', 'ת', ' ', 'הָ', 'אָ', 'רֶ', 'ץ']
>>> len(list(grapheme_clusters('בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַי , ואֵת הָאָרֶץ')))
37

This looks like the proper way to do it.

Here's an example that patches up textwrap. Solutions for patching up other modules should be similar.

>>> import textwrap
>>> text = 'בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשּׁמַיִם, וְאֵת הָאָרֶץ'
>>> print(textwrap.fill(text, width=40))  # bad, aggressive wrapping
בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת
הַשָּׁמַיִם, וְאֵת הָאָרֶץ
>>> import uniseg.graphemecluster
>>> def new_len(x):
...     if isinstance(x, str):
...         return sum(1 for _ in uniseg.graphemecluster.grapheme_clusters(x))
...     return len(x)
>>> textwrap.len = new_len
>>> print(textwrap.fill(text, width=40))  # Good wrapping
בְּרֵאשִׁית, בָּרָא אֱלֹהִים, אֵת הַשָּׁמַיִם, וְאֵת הָאָרֶץ
Conley Owens
  • 8,691
  • 5
  • 30
  • 43
  • 3
    You could also use `regex` module: `count_user_perceived_characters = lambda text: len(regex.findall(r'\X', text))` – jfs Jun 17 '15 at 16:26
  • @J.F.Sebastian Neat! That project says it intends to replace `re`. Do you have any idea if it actually will? – Conley Owens Jun 17 '15 at 18:55