Normalize composite/decomposable/variable-length characters (unicode/python3.4)

Question

I stumbled upon http://mortoray.com/2013/11/27/the-string-type-is-broken/

And to my horror...

print(len('noe\u0308l')) # returns 5 not 4

However I found https://stackoverflow.com/a/14682498/1267259, Normalizing Unicode

from unicodedata import normalize
print(len(unicodedata.normalize('NFC','noe\u0308l'))) # returns 4

But what do I do with the Schrödinger's cats?

print(len('')) # returns 4 not 2

(side question: in my text editor when I'm trying to save I get a "utf-8 codec can't encode character x in position y: surrogates not allowed" but in the command prompt I can paste and run code with those characters, I assume it is because the cats exist on a different quantum level (SMP) but how do I normalize them?)

Is there anything else I should do to make sure all characters are counted as "1"?

Which specific version of Python 3? Unicode processing has undergone a change or two. — Mark Ransom, Apr 27 '15 at 21:05

Martijn Pieters · Accepted Answer · 2015-04-27T21:29:01.660

2

Your editor is producing surrogate pairs, not the actual code points, which is why you are also getting that warning. Use:

'\U0001f638\U0001f63e'

to define the cats without resorting to surrogates.

If you do have a string with surrogates, you can recode these via UTF-16 and allowing surrogates to be encoded with the 'surrogatepass' error handler:

>>> # \U0001f638 is \ud83d\ude38 when using UTF-16 surrogates
...
>>> '\ud83d\ude38'.encode('utf16', 'surrogatepass').decode('utf16')
''
>>> len(_)
1

From the Error Handlers documentation:

'surrogateescape'
On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data. (See PEP 383 for more.)

edited Apr 27 '15 at 21:29

answered Apr 27 '15 at 21:15

Martijn Pieters

1,048,767
296
4,058
3,343

In hindsight, that was so obvious. I'm leaving my answer up in case it's useful to a passerby later. – Mark Ransom Apr 27 '15 at 21:20
Thanks, that did take care of the error message, unfortunately no more cute cats... Although `len('\U0001f638\U0001f63e')` returns 2 now. But how do I ensure that actual code points are always used or surrogate pairs are never used? Do I need to "normalize" the string somehow? Do I have to do as Mark R suggested? – user1267259 Apr 27 '15 at 21:30
1

@user1267259: what is your *normal* source of data input? If they are string literals in the source, I'd use `\Uhhhhhhhh` escape sequences, as it saves a lot of headaches with editors and encoding configurations. When reading from a file or network connection, use the right codec, you should not normally end up with surrogates in your Unicode string values, only in the encoded bytes (where they belong). – Martijn Pieters Apr 27 '15 at 21:34
Ahh, I see. My normal source are a bunch of text files. I've striped out non-Latin characters but had a feeling composite characters could cause problems so I googled until I found mentioned article. The variable-length characters were new to me, and in my panic I assumed they too could be considered "decomposed" somehow (given the result of the length). It didn't help that I copied and pasted those cats :) But if I understand you: as long as I load/encode those files correct there shouldn't be any problems with surrogates. I only need to unicodedata.normalize (sorry if I'm being redundant). – user1267259 Apr 27 '15 at 21:49

score 0 · Answer 2 · answered Apr 27 '15 at 21:12

0

For consistent codepoint counting on any version of Python, encode to UTF-32 and divide the byte count by 4.

print(len(unicodedata.normalize('NFC','noe\u0308l').encode('utf-32le')) / 4)
print(len('\U0001f638\U0001f63e'.encode('utf-32le')) / 4)

answered Apr 27 '15 at 21:12

Mark Ransom

299,747
42
398
622

But the OP is using Python 3.4. This is not a polyglot issue. – Martijn Pieters Apr 27 '15 at 21:14
@MartijnPieters that information wasn't added until I left my answer. If you have a better one, go for it, because at this point I'm mystified. – Mark Ransom Apr 27 '15 at 21:14

Normalize composite/decomposable/variable-length characters (unicode/python3.4)

2 Answers2