2

I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters.

Sample code to make the question very obvious :

# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)

12

print str

睡眠時間 <<< note that four characters are displayed

How can i know, from the string, that 4 characters are going to be displayed ?

knightofni
  • 1,906
  • 3
  • 17
  • 22
  • this might be useful http://stackoverflow.com/questions/16528005/find-the-length-of-a-sentence-with-english-words-and-chinese-characters – sundar nataraj Sep 08 '14 at 10:32

2 Answers2

9

This string

str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'

Is an encoded representation of unicode code points. It contain bytes, len(str) returns you amount of bytes.

You want to know, how many unicode codes contains the string. For that, you need to know, what encoding was used to encode those unicode codes. The most popular encoding is utf8. In utf8 encoding, one unicode code point can take from 1 to 6 bytes. But you must not remember that, just decode the string:

>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'

Here you can see 4 unicode points. Print it, to see printable version:

>>> print str.decode('utf8')
睡眠時間

And get amount of unicode codes:

>>> len(str.decode('utf8'))
4

UPDATE: Look also at abarnert answer to respect all possible cases.

Community
  • 1
  • 1
stalk
  • 11,934
  • 4
  • 36
  • 58
  • The question asks for the number of displayable characters and not the number of code points – David Heffernan Sep 08 '14 at 10:35
  • 1
    It is most definitely not the same. Non printable code points for a start. And then composition, and so on. – David Heffernan Sep 08 '14 at 10:36
  • Ok, there could be some non-printable unicode codes, yes. But in that case it is needed to exclude them explicitly before appling `len` i suppose. And this will be a more complicated task. – stalk Sep 08 '14 at 10:38
  • Anyway, the asker doesn't seem to care about such issues. One example where your approach works appears to be enough to prove to the asker that it will work always. Never mind. – David Heffernan Sep 08 '14 at 10:41
  • 1
    @stalk: There _are_ some non-printable Unicode code points. More importantly, there are some printable Unicode code points that combine with other code points to make what most people normally think of as a single character. (And this is a real issue—e.g., Cocoa generally gives you decomposed strings, which don't match up code-unit-by-code-unit with the composed strings you get for the same characters from some POSIX APIs.) – abarnert Sep 08 '14 at 10:57
3

If you actually want "displayable characters", you have to do two things.

First, you have to convert the string from UTF-8 to Unicode, as explained by stalk:

s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')

Next, you have to filter out all code points that don't represent displayable characters. You can use the unicodedata module for this. The category function can give you the general category of any code unit. To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs.

For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. Then you'd write:

def displayable(c):
    return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))

Or, if you've decided that Mn and Me are also not "displayable" but Mc is:

def displayable(c):
    return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}

But even this may not be what you want. For example, does a nonspacing combining mark followed by a letter count as one character or two? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. Until you can answer that, of course, nobody can tell you how to do it.


If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. In 3.x, you can just write:

p = ''.join(c for c in u is c.isprintable())

But of course that only works if their definition of "printable" happens to match what you mean by "displayable". And it very well may not—for example, they consider all separators except ' ' non-printable. Obviously they can't include definitions for any distinction anyone might want to make.

abarnert
  • 354,177
  • 51
  • 601
  • 671