I need to step through a Python string one character at a time, but a simple "for" loop gives me UTF-16 code units instead:
str = "abc\u20ac\U00010302\U0010fffd"
for ch in str:
code = ord(ch)
print("U+{:04X}".format(code))
That prints:
U+0061
U+0062
U+0063
U+20AC
U+D800
U+DF02
U+DBFF
U+DFFD
when what I wanted was:
U+0061
U+0062
U+0063
U+20AC
U+10302
U+10FFFD
Is there any way to get Python to give me the sequence of Unicode code points, regardless of how the string is actually encoded under the hood? I'm testing on Windows here, but I need code that will work anywhere. It only needs to work on Python 3, I don't care about Python 2.x.
The best I've been able to come up with so far is this:
import codecs
str = "abc\u20ac\U00010302\U0010fffd"
bytestr, _ = codecs.getencoder("utf_32_be")(str)
for i in range(0, len(bytestr), 4):
code = 0
for b in bytestr[i:i + 4]:
code = (code << 8) + b
print("U+{:04X}".format(code))
But I'm hoping there's a simpler way.
(Pedantic nitpicking over precise Unicode terminology will be ruthlessly beaten over the head with a clue-by-four. I think I've made it clear what I'm after here, please don't waste space with "but UTF-16 is technically Unicode too" kind of arguments.)