Why python string cut returns 11 symbols when 12 is requested?

Question

I use python 2.7 on OSX 10.9 and would like to cut unicode string (05. Чайка.mp3) by 12 symbols, so I use mp3file[:12] to cut it by 12 symbols. But in result I get the string like 05. Чайка.m, which is 11 symbols only. But len(mp3file[:12]) returns 12. Looks like the problem is with Russian symbol й.

What could be wrong here?

The main problem with this - I can not normally display strings with {:<12}'.format(mp3file[:12]).

Is the original string prefixed with a magic whitespace codepoint like a BOM? — RichieHindle, Apr 27 '14 at 11:45
@sshashank124, I've updated the question with the exact string. — LA_, Apr 27 '14 at 11:50
@MartijnPieters, it shows `u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m'`. — LA_, Apr 27 '14 at 11:50

score 5 · Accepted Answer · edited May 23 '17 at 10:32

5

You have unicode text with a combining character:

u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m'

The U+0306 is a COMBINING BREVE codepoint, ̆, it combines with the preceding и CYRILLIC SMALL LETTER I to form:

>>> print u'\u0438'
и
>>> print u'\u0438\u0306'
й

You can normalize that to the combined form, U+0439 CYRILLIC SMALL LETTER SHORT I instead:

>>> import unicodedata
>>> unicodedata.normalize('NFC', u'\u0438\u0306')
u'\u0439'

This uses the unicodedata.normalize() function to produce a composed normal form.

edited May 23 '17 at 10:32

Community

1
1

answered Apr 27 '14 at 11:53

Martijn Pieters

1,048,767
296
4,058
3,343

Thanks, got it. But how should I display it properly with `{:<12}.format` then? – LA_ Apr 27 '14 at 11:54
2

@LA_: you can use the normalized form; I added an example. – Martijn Pieters Apr 27 '14 at 11:58
We could do with a canonical answer that addresses issues like this... great answer that'd I more than +1 if I could though :) – Jon Clements Apr 27 '14 at 12:14
[`.normalize()` may fail to combine Unicode codepoints that belong to the same graphem cluster](http://stackoverflow.com/a/23323520/4279) – jfs Apr 27 '14 at 12:42

score 3 · Answer 2 · edited May 23 '17 at 10:32

3

A user-perceived character (grapheme cluster) such as й may be constructed using several Unicode codepoints, each Unicode codepoints in turn may be encoded using several bytes depending on a character encoding.

Therefore number of characters that you see may be less the corresponding sizes of Unicode or byte strings that encode them and you can also truncate inside a Unicode character if you slice a bytestring or inside a user-perceived character if you slice a Unicode string even if it is in NFC Unicode normalization form. Obviously, it is not desirable.

To properly count characters, you could use \X regex that matches eXtended grapheme cluster (a language independent "visual character"):

import regex as re # $ pip install regex

characters = re.findall(u'\\X', u'05. \u0427\u0430\u0438\u0306\u043a\u0430.m')
print(characters)
# -> [u'0', u'5', u'.', u' ', u'\u0427', u'\u0430', 
#     u'\u0438\u0306', u'\u043a', u'\u0430', u'.', u'm']

Notice, that even without normalization: u'\u0438\u0306' is a separate character 'й'.

>>> import unicodedata
>>> unicodedata.normalize('NFC', u'\u0646\u200D ') # 3 Unicode codepoints
u'\u0646\u200d ' # still 3 codepoints, NFC hasn't combined them
>>> import regex as re
>>> re.findall(u'\\X', u'\u0646\u200D ') # same 3 codepoints
[u'\u0646\u200d', u' '] # 2 grapheme clusters

See also, In Python, how do I most efficiently chunk a UTF-8 string for REST delivery?

edited May 23 '17 at 10:32

Community

1
1

answered Apr 27 '14 at 12:39

jfs

399,953
195
994
1,670

Detecting graphemes will not help fitting in a fixed character width when formatting a string, though. Isn't Unicode fun? I can't wait till the OP discovers wide Asian characters! :-P – Martijn Pieters Apr 28 '14 at 07:40
At the very least graphemes tell you how many characters you will see on the screen. :) Width may be context dependent even for ASCII. – jfs Apr 28 '14 at 11:06
How does `\X` handle zero-width spaces or Bidi marks (right-to-left, left-to-right), etc. btw? You demoed with a zero-width joiner here. Just curious; I should probably just install `regex` again and test this myself.. – Martijn Pieters Apr 28 '14 at 11:10
[the first link](http://www.unicode.org/reports/tr29/) in the answer describes all the gory details – jfs Apr 28 '14 at 11:34
I see there are *tailored* grapheme clusters (locale specific, such as `ch` in Slovak, or the `ij` in my Dutch name), but presumably `regex` doesn't support those. No mention in the documentation, at least. – Martijn Pieters Apr 28 '14 at 11:47
@MartijnPieters: \X doesn't match tailored graphemes. – jfs Apr 28 '14 at 22:16

Why python string cut returns 11 symbols when 12 is requested?

2 Answers2

Linked