1

I'm working on a program where I should reject any code point above U+10FFFF. This seems straightforward enough, except I can't figure out how to represent such a range of code points in my regular expression. I want to do something like this

valid_character = re.compile(u'[\u0000-\u10FFFF]')

and then have anything that doesn't match that be handled appropriately. However, \u only seems to recognize the first four characters, namely 10FF. Is there another way to represent this code point range or handle this situation?

This site recommends u"\U0010FFFF" but that doesn't seem to work either.

>>> ord(u'\U0010FFFF')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
Dan Oberlam
  • 2,435
  • 9
  • 36
  • 54
  • What does your input look like? Python should, by definition, reject any Unicode "character" above U+10ffff, since they do not exist. – chepner Dec 27 '14 at 16:50
  • @chepner I'll be getting some input file and (among other things) I have to strip out characters that do not fall in this range – Dan Oberlam Dec 27 '14 at 16:52
  • 1
    It can't be specified with the `\u` or `\U` syntax, since characters above `U+10FFFF` are not valid Unicode. What is the encoding of your file? Provide a sample with the characters you need to filter. – Mark Tolonen Dec 27 '14 at 16:56
  • Isn't it possible to synthesize a character code that is technically valid but not valid for (the current version of) Unicode? – Jongware Dec 27 '14 at 16:58
  • @MarkTolonen unfortunately I don't have a good sample of a file; this is just part of a preprocessing routine I was asked to write and I was told that it was required. As far as I know there is no reason for us to expect such an input or any encoding besides UTF-8. That is sort of why I'm asking - it seems nonsensical to me. If it truly is nonsensical then I'm fine with going back to my boss with that. – Dan Oberlam Dec 27 '14 at 16:59
  • @Jongware, it's possible to create, say, a 5- or 6-byte UTF-8-like encoded character manually, just not with Python's `\U` syntax. – Mark Tolonen Dec 27 '14 at 17:00
  • @MarkTolonen: got it -- so Python itself guards against that. So, by extension, you cannot use Python string functions to check for a value *outside* of it? Can an externally loaded text string *also* not contain invalid Unicode? (Which implies Python would clean input before converting it to a string object -- correct?) – Jongware Dec 27 '14 at 17:04
  • 3
    The original [UTF-8 design](http://en.wikipedia.org/wiki/UTF-8#Description) allows for 5- and 6-byte UTF-8 encodings so it is possible for someone to generate a file with illegal Unicode characters encoded that way. – Mark Tolonen Dec 27 '14 at 17:05
  • Apologies for asking the above clarification, as it veers off the topic of the question: "why can't OP use the *correct* way for Unicode characters?" – Jongware Dec 27 '14 at 17:07
  • 2
    If you decode a file with UTF-8 that violates the spec, Python will throw an error, so the answer to your question is "just open the file and decode it as UTF-8". Python will handle it if the characters are invalid. – Mark Tolonen Dec 27 '14 at 17:10
  • Cool, that's all I wanted to know. Feel free to answer the question with that and I'll mark it as accepted – Dan Oberlam Dec 27 '14 at 17:11
  • 2
    There are no Unicode characters and no Unicode code points beyond U+10FFFF, according to the definitions of the Unicode standard. You should rewrite the question. – Jukka K. Korpela Dec 27 '14 at 17:49
  • @JukkaK.Korpela better? – Dan Oberlam Dec 27 '14 at 17:52
  • As I wrote, there are Unicode code points either beyond U+10FFFF. You should describe the data you are actually dealing with, instead of calling it Unicode code points or characters when it cannot be that. – Jukka K. Korpela Dec 27 '14 at 17:56

1 Answers1

3

If you decode a file with UTF-8 that violates the spec, Python will throw an error, so the answer to your question is "just open the file and decode it as UTF-8". Python will handle it if the characters are invalid.

Example:

>>> b'\xf4\x8f\xbf\xbf'.decode('utf8')
u'\U0010ffff'

# UTF-8 equivalent to \U00110000...
>>> len(b'\xf4\x90\x80\x80'.decode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\dev\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid continuation byte
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • This solves the OP's X-Y problem, so good call on that ... but it left me wondering how to construct OP's regex. – Jongware Dec 27 '14 at 17:31
  • 1
    @Jongware, maybe something like in [this answer](http://stackoverflow.com/a/9848242/235698). It finds valid UTF-8 sequences. – Mark Tolonen Dec 27 '14 at 19:32