0

I stumbled upon http://mortoray.com/2013/11/27/the-string-type-is-broken/

And to my horror...

print(len('noe\u0308l')) # returns 5 not 4

However I found https://stackoverflow.com/a/14682498/1267259, Normalizing Unicode

from unicodedata import normalize
print(len(unicodedata.normalize('NFC','noe\u0308l'))) # returns 4

But what do I do with the Schrödinger's cats?

print(len('')) # returns 4 not 2

(side question: in my text editor when I'm trying to save I get a "utf-8 codec can't encode character x in position y: surrogates not allowed" but in the command prompt I can paste and run code with those characters, I assume it is because the cats exist on a different quantum level (SMP) but how do I normalize them?)

Is there anything else I should do to make sure all characters are counted as "1"?

Community
  • 1
  • 1
user1267259
  • 761
  • 2
  • 10
  • 22

2 Answers2

2

Your editor is producing surrogate pairs, not the actual code points, which is why you are also getting that warning. Use:

'\U0001f638\U0001f63e'

to define the cats without resorting to surrogates.

If you do have a string with surrogates, you can recode these via UTF-16 and allowing surrogates to be encoded with the 'surrogatepass' error handler:

>>> # \U0001f638 is \ud83d\ude38 when using UTF-16 surrogates
...
>>> '\ud83d\ude38'.encode('utf16', 'surrogatepass').decode('utf16')
''
>>> len(_)
1

From the Error Handlers documentation:

'surrogateescape'
On decoding, replace byte with individual surrogate code ranging from U+DC80 to U+DCFF. This code will then be turned back into the same byte when the 'surrogateescape' error handler is used when encoding the data. (See PEP 383 for more.)

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • In hindsight, that was so obvious. I'm leaving my answer up in case it's useful to a passerby later. – Mark Ransom Apr 27 '15 at 21:20
  • Thanks, that did take care of the error message, unfortunately no more cute cats... Although `len('\U0001f638\U0001f63e')` returns 2 now. But how do I ensure that actual code points are always used or surrogate pairs are never used? Do I need to "normalize" the string somehow? Do I have to do as Mark R suggested? – user1267259 Apr 27 '15 at 21:30
  • 1
    @user1267259: what is your *normal* source of data input? If they are string literals in the source, I'd use `\Uhhhhhhhh` escape sequences, as it saves a lot of headaches with editors and encoding configurations. When reading from a file or network connection, use the right codec, you should not normally end up with surrogates in your Unicode string values, only in the encoded bytes (where they belong). – Martijn Pieters Apr 27 '15 at 21:34
  • Ahh, I see. My normal source are a bunch of text files. I've striped out non-Latin characters but had a feeling composite characters could cause problems so I googled until I found mentioned article. The variable-length characters were new to me, and in my panic I assumed they too could be considered "decomposed" somehow (given the result of the length). It didn't help that I copied and pasted those cats :) But if I understand you: as long as I load/encode those files correct there shouldn't be any problems with surrogates. I only need to unicodedata.normalize (sorry if I'm being redundant). – user1267259 Apr 27 '15 at 21:49
0

For consistent codepoint counting on any version of Python, encode to UTF-32 and divide the byte count by 4.

print(len(unicodedata.normalize('NFC','noe\u0308l').encode('utf-32le')) / 4)
print(len('\U0001f638\U0001f63e'.encode('utf-32le')) / 4)
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622