3

I'm using chr() to run through a list of unicode characters, but whenever it comes across a character that is unassigned, it just continues running, and doesnt error out or anything. How do i check if the output of chr() will be undefined?

for example,

print(chr(55396))

is in range of unicode, it's just an unassigned character, how do i check what the output of chr() will give me an actual character that way this hangup doesn't occur?

VintiumDust
  • 95
  • 1
  • 6
  • 1
    That throws a `UnicodeEncodeError` for me. So, try catching that? – grooveplex Feb 27 '19 at 23:51
  • @grooveplex: You see that error only because you tried to `print` it (which tries to encode it in your system locale encoding); the character itself is created without error. – ShadowRanger Feb 27 '19 at 23:54
  • I was just following the example that OP gave, so if that's their actual code, they can try doing that. – grooveplex Feb 27 '19 at 23:55
  • I don't think it's unassigned. It's just a high surrogate without a corresponding low surrogate. – ShadowRanger Feb 27 '19 at 23:58
  • 1
    @grooveplex: The encoding error is a sort of backwards way of detecting this. The problem is they've made a high surrogate, which makes no sense without a paired low surrogate (and which really only makes sense if you're trying to decompose the string into UTF-16 form). It's a legal character, if paired properly, but in isolation it's gibberish. – ShadowRanger Feb 28 '19 at 00:04
  • @ShadowRanger Didn't know that. Thanks! – grooveplex Feb 28 '19 at 00:04
  • What do you consider to be "undefined"? Apparently you don't want to allow surrogate code points. What about [private-use characters and "noncharacters"](http://www.unicode.org/faq/private_use.html)? – user2357112 Feb 28 '19 at 00:45

1 Answers1

3

You could use the unicodedata module:

>>> import unicodedata
>>> unicodedata.name(chr(55396))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> unicodedata.name(chr(120))
'LATIN SMALL LETTER X'
>>>
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • 1
    Flaw: `unicodedata.name(chr(0))` (along with tons of non-printing, but valid ASCII) has no name in the database. I was going to suggest this, but I don't think a solution that can't distinguish invalid characters from unnamed characters is viable in most cases. – ShadowRanger Feb 28 '19 at 00:06
  • 1
    I guess it depends on what the OP needs. – Ned Batchelder Feb 28 '19 at 00:07
  • 2
    Perhaps. I was just checking though, and some of the unnamed characters will occur in real text. All ordinals below 32 are unnamed, including newlines, tabs, and carriage returns. I don't think you want to be told that `chr(10)` (aka `'\n'`) is an invalid character. – ShadowRanger Feb 28 '19 at 00:12