How do I get the "visible" length of a combining Unicode string in Python?

Question

If I have a Python Unicode string that contains combining characters, len reports a value that does not correspond to the number of characters "seen".

For example, if I have a string with combining overlines and underlines such as u'A\u0332\u0305BC', len(u'A\u0332\u0305BC') reports 5; but the displayed string is only 3 characters long.

How do I get the "visible" — that is, number of distinct positions occupied by the string the user sees — length of a Unicode string containing combining glyphs in Python?

hmm this is interesting, the best I can think of is just stripping the unwanted chars. — postelrich, Oct 26 '15 at 17:22
@riotburn: That will be difficult. The characters could be arbitrary (user-supplied). I'd need to consult a list of what Unicode glyphs are combining — unless that's a systematic part of the encoding. — orome, Oct 26 '15 at 17:25

dawg · Answer 1 · 2015-10-29T21:30:50.313

5

If you have a regex flavor that supports matching grapheme, you can use \X

Demo

While the default Python re module does not support \X, Matthew Barnett's regex module does:

>>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
3

On Python 2, you need to use u in the pattern:

>>> regex.findall(u'\\X', u'A\u0332\u0305BC')
[u'A\u0332\u0305', u'B', u'C']
>>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
3

edited Oct 29 '15 at 21:30

answered Oct 26 '15 at 19:18

dawg

98,345
23
131
206

score 4 · Accepted Answer · answered Oct 26 '15 at 17:55

4

The unicodedata module has a function combining that can be used to determine if a single character is a combining character. If it returns 0 you can count the character as non-combining.

import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))

or, slightly simpler:

sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)

answered Oct 26 '15 at 17:55

Mark Ransom

299,747
42
398
622

1

Or: `sum(not unicodedata.combining(ch) for ch in u'A\u0332\u0305BC')`. – Bakuriu Oct 26 '15 at 19:01
@Bakuriu at first I thought that wouldn't work since `combining` returns integers that aren't `0` or `1`, but `not` takes care of that. Well done! – Mark Ransom Oct 26 '15 at 19:22
3

This doesn't work for grapheme clusters made from non-marking characters, for example: `u'\u1100\u1161\u11A8'` (각). – 一二三 Oct 26 '15 at 22:04
@一二三 is there something else in `unicodedata` that would handle that case? – Mark Ransom Oct 26 '15 at 22:17
1

No, `unicodedata` is insufficient for solving this problem as it doesn't expose the "Grapheme_Cluster_Break" property. Libraries like `PyICU` do, but the answer by @dawg is probably the simplest way to get at it. – 一二三 Oct 27 '15 at 00:00
BTW `0` is falsy, so you could use `not` instead of `== 0`, like this `len(u''.join(ch for ch in u'A\u0332\u0305BC' if not unicodedata.combining(ch)))` – wjandrea Feb 20 '19 at 20:14

AXO · Answer 3 · 2016-02-17T05:40:10.050

3

Combining characters are not the only zero-width characters:

>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
1

("\u200c" or "‌" is zero-width non-joiner; it's a non-printing character.)

In this case the regex module does not work either:

>>> len(regex.findall(r'\X', u'\u200c'))
1

I found wcwidth that handles the above case correctly:

>>> from wcwidth import wcswidth
>>> wcswidth(u'A\u0332\u0305BC')
3
>>> wcswidth(u'\u200c')
0

But still doesn't seem to work with user 596219's example:

>>> wcswidth('각')
4

edited Feb 17 '16 at 05:40

answered Feb 17 '16 at 02:07

AXO

8,198
6
62
63

1

The regex module has some updates about zero-width matches in Python 3.7, so maybe it will work properly now. I haven't tried it myself. – wjandrea Feb 20 '19 at 15:04

How do I get the "visible" length of a combining Unicode string in Python?

3 Answers3

Linked