I'm working on a program where I should reject any code point above U+10FFFF. This seems straightforward enough, except I can't figure out how to represent such a range of code points in my regular expression. I want to do something like this
valid_character = re.compile(u'[\u0000-\u10FFFF]')
and then have anything that doesn't match that be handled appropriately. However, \u
only seems to recognize the first four characters, namely 10FF
. Is there another way to represent this code point range or handle this situation?
This site recommends u"\U0010FFFF"
but that doesn't seem to work either.
>>> ord(u'\U0010FFFF')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found