13

If I have a Python Unicode string that contains combining characters, len reports a value that does not correspond to the number of characters "seen".

For example, if I have a string with combining overlines and underlines such as u'A\u0332\u0305BC', len(u'A\u0332\u0305BC') reports 5; but the displayed string is only 3 characters long.

How do I get the "visible" — that is, number of distinct positions occupied by the string the user sees — length of a Unicode string containing combining glyphs in Python?

orome
  • 45,163
  • 57
  • 202
  • 418
  • 1
    hmm this is interesting, the best I can think of is just stripping the unwanted chars. – postelrich Oct 26 '15 at 17:22
  • 1
    @riotburn: That will be difficult. The characters could be arbitrary (user-supplied). I'd need to consult a list of what Unicode glyphs are combining — unless that's a systematic part of the encoding. – orome Oct 26 '15 at 17:25

3 Answers3

5

If you have a regex flavor that supports matching grapheme, you can use \X

Demo

While the default Python re module does not support \X, Matthew Barnett's regex module does:

>>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
3

On Python 2, you need to use u in the pattern:

>>> regex.findall(u'\\X', u'A\u0332\u0305BC')
[u'A\u0332\u0305', u'B', u'C']
>>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
3
dawg
  • 98,345
  • 23
  • 131
  • 206
4

The unicodedata module has a function combining that can be used to determine if a single character is a combining character. If it returns 0 you can count the character as non-combining.

import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))

or, slightly simpler:

sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 1
    Or: `sum(not unicodedata.combining(ch) for ch in u'A\u0332\u0305BC')`. – Bakuriu Oct 26 '15 at 19:01
  • @Bakuriu at first I thought that wouldn't work since `combining` returns integers that aren't `0` or `1`, but `not` takes care of that. Well done! – Mark Ransom Oct 26 '15 at 19:22
  • 3
    This doesn't work for grapheme clusters made from non-marking characters, for example: `u'\u1100\u1161\u11A8'` (각). – 一二三 Oct 26 '15 at 22:04
  • @一二三 is there something else in `unicodedata` that would handle that case? – Mark Ransom Oct 26 '15 at 22:17
  • 1
    No, `unicodedata` is insufficient for solving this problem as it doesn't expose the "Grapheme_Cluster_Break" property. Libraries like `PyICU` do, but the answer by @dawg is probably the simplest way to get at it. – 一二三 Oct 27 '15 at 00:00
  • BTW `0` is falsy, so you could use `not` instead of `== 0`, like this `len(u''.join(ch for ch in u'A\u0332\u0305BC' if not unicodedata.combining(ch)))` – wjandrea Feb 20 '19 at 20:14
3

Combining characters are not the only zero-width characters:

>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
1

("\u200c" or "‌" is zero-width non-joiner; it's a non-printing character.)

In this case the regex module does not work either:

>>> len(regex.findall(r'\X', u'\u200c'))
1

I found wcwidth that handles the above case correctly:

>>> from wcwidth import wcswidth
>>> wcswidth(u'A\u0332\u0305BC')
3
>>> wcswidth(u'\u200c')
0

But still doesn't seem to work with user 596219's example:

>>> wcswidth('각')
4
AXO
  • 8,198
  • 6
  • 62
  • 63
  • 1
    The regex module has some updates about zero-width matches in Python 3.7, so maybe it will work properly now. I haven't tried it myself. – wjandrea Feb 20 '19 at 15:04