How to check if a chr()'s output will be undefined

Question

I'm using chr() to run through a list of unicode characters, but whenever it comes across a character that is unassigned, it just continues running, and doesnt error out or anything. How do i check if the output of chr() will be undefined?

for example,

print(chr(55396))

is in range of unicode, it's just an unassigned character, how do i check what the output of chr() will give me an actual character that way this hangup doesn't occur?

That throws a `UnicodeEncodeError` for me. So, try catching that? — grooveplex, Feb 27 '19 at 23:51
@grooveplex: You see that error only because you tried to `print` it (which tries to encode it in your system locale encoding); the character itself is created without error. — ShadowRanger, Feb 27 '19 at 23:54
I was just following the example that OP gave, so if that's their actual code, they can try doing that. — grooveplex, Feb 27 '19 at 23:55
I don't think it's unassigned. It's just a high surrogate without a corresponding low surrogate. — ShadowRanger, Feb 27 '19 at 23:58
@grooveplex: The encoding error is a sort of backwards way of detecting this. The problem is they've made a high surrogate, which makes no sense without a paired low surrogate (and which really only makes sense if you're trying to decompose the string into UTF-16 form). It's a legal character, if paired properly, but in isolation it's gibberish. — ShadowRanger, Feb 28 '19 at 00:04
What do you consider to be "undefined"? Apparently you don't want to allow surrogate code points. What about [private-use characters and "noncharacters"](http://www.unicode.org/faq/private_use.html)? — user2357112, Feb 28 '19 at 00:45

score 3 · Answer 1 · answered Feb 27 '19 at 23:58

3

You could use the unicodedata module:

>>> import unicodedata
>>> unicodedata.name(chr(55396))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name
>>> unicodedata.name(chr(120))
'LATIN SMALL LETTER X'
>>>

answered Feb 27 '19 at 23:58

Ned Batchelder

364,293
75
561
662

1

Flaw: `unicodedata.name(chr(0))` (along with tons of non-printing, but valid ASCII) has no name in the database. I was going to suggest this, but I don't think a solution that can't distinguish invalid characters from unnamed characters is viable in most cases. – ShadowRanger Feb 28 '19 at 00:06
1

I guess it depends on what the OP needs. – Ned Batchelder Feb 28 '19 at 00:07
2

Perhaps. I was just checking though, and some of the unnamed characters will occur in real text. All ordinals below 32 are unnamed, including newlines, tabs, and carriage returns. I don't think you want to be told that `chr(10)` (aka `'\n'`) is an invalid character. – ShadowRanger Feb 28 '19 at 00:12

How to check if a chr()'s output will be undefined

1 Answers1