Python 2.7.3 UTF-8 Encoding Irreversible

Question

I've come across a few very troublesome strings while crawling the web. In particular, a page advertises as being UTF-7, and though it's not quite UTF-7 that doesn't appear to be the issue. I'm not concerned with representing the exact intent of the text, but I just need to get into UTF-8 for downstream consumption.

The oddity I'm faced with is that I'm able to get a unicode string that cannot be first UTF-8 encoded and then decoded. I've distilled the string down as much as I can while still exhibiting the error:

bytes = [43, 105, 100, 41, 46, 101, 95, 39, 43, 105, 100, 43]
string = ''.join(chr(c) for c in bytes)

# This particular string happens to be advertised as UTF-7, though it is
# a bit malformed. We'll ignore these errors when decoding it.
decoded = string.decode('utf-7', 'ignore')

# This decoded string, however, cannot be encoded into UTF-8 and back:
error = decoded.encode('utf-8').decode('utf-8')

I've tried this on a number of systems successfully: Python 2.7.1 and 2.6.7 on Mac 10.5.7, Python 2.7.2 and 2.6.8 on CentOS. Unfortunately, on the machines we need it to work on it's failing with Python 2.7.3 on Ubuntu 12.04. On the failing system, I see:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf7 in position 4: invalid start byte

Here are some of the intermediate values that I see on the working vs. non-working systems:

# Working:
>>> repr(decoded)
'u".e_\'\\u89df"'
>>> repr(decoded.encode('utf-8'))
'".e_\'\\xe8\\xa7\\x9f"'

# Non-working:
>>> repr(decoded)
'u".e_\'\\U089d89df"'
>>> repr(decoded.encode('utf-8'))
'".e_\'\\xf7\\x98\\xa7\\x9f"'

The two are different after the first encoding, though why is a mystery to me still. I imagine that it's an issue with lacking some character tables, or an auxiliary library because it doesn't appear that anything between 2.7.2 and 2.7.3 would explain this behavior. On the systems where it works correctly, printing the unicode entity displays a Chinese symbol, but a placeholder on the system where it does not.

This leaves me to my question: does such an issue look familiar to anyone, or does anyone have an idea what supporting libraries I might be missing on the system that's having the issue?

Can you print out the characters of `decoded` on the working and non-working systems? And, if those are identical, try printing out the intermediate result of `decoded.encode('utf-8')` on the two systems. — abarnert, Apr 17 '13 at 21:37
What you _should_ get are `"+id).e_'+id+"`, `u".e_'\u89df"`, and `".e_'\xe8\xa7\x9f"`. In other words, no `xf7` anywhere to decode… — abarnert, Apr 17 '13 at 21:39
Sorry, can you show the `repr` rather than the `str` of each? (In other words, just leave off the `print`. That's my fault, since I explicitly asked you to "print"…) — abarnert, Apr 17 '13 at 21:42
Updated. I had noticed that the outputs from the decoding were different, but regardless it seems odd to me that an `encode` and `decode` pair would ever fail if the encoding succeeds (as in this case) — Dan Lecocq, Apr 17 '13 at 21:47
OK, now we know that the problem is in the UTF-7 decoding. You're getting back `u".e_'\U089d89df"` instead of `u".e_'\u89df"`. You can see that problem if you try to just print `u"\U089d89df"`. (By the way, doing `repr(decoded)` is _too far_; I just wanted to evaluate `decoded`. But that's fine.) — abarnert, Apr 17 '13 at 21:47
The problem is that you've got a Unicode string with an illegal Unicode character in the middle of it. The UTF-8 encoder probably just assumes that the `unicode` object is valid instead of checking, because there should be no way to ever get an invalid `unicode` object except for an intentionally nasty C extension. Or, of course, a bug in one of the built-in codecs, which seems to be what's hitting you. (For example, you can't get it by evaluating the literal `u"\U089d89df"`.) — abarnert, Apr 17 '13 at 21:50
One possible workaround to try: After the UTF-7 decode, do `u''.join(c for c in s if ord(c) < 0x100000)`. That will throw out anything that gets decoded to an illegal Unicode code point. (You might even want to use `< 0x10000` if you're sure you only need BMP characters, because that makes everything simpler.) — abarnert, Apr 17 '13 at 21:55
Thanks for your help -- I may end up doing that. It's a shame that using the 'ignore' when originally decoding it is insufficient -- these are the kinds of hoops that one shouldn't have to jump through for string processing :-) — Dan Lecocq, Apr 17 '13 at 22:00
Clearly there's a bug in the UTF-7 decoder, possibly only on wide-unicode builds, possibly even specific to certain other build settings. The question is whether it's worth trying to track that down or not. It might be worth checking 2.7.4 (and, for that matter, 3.3.1) and/or a later 2.7.3 .deb, searching the change logs, etc.—but still, if you want to run on systems with that Python package, you need the workaround, right? — abarnert, Apr 17 '13 at 22:05

abarnert · Accepted Answer · 2013-04-17T22:08:58.317

The problem here is that the UTF-7 decode is, for some reason, returning you illegal Unicode characters.

It's basically not documented what happens when you've got a unicode object with illegal characters in it. The C API basically just tells you "don't do that, or things will break". The Python API doesn't mention it because it should be impossible, unless you've done something undefined with the C API, which is already covered.

Unless, of course, a bug in the built-in codecs causes it do to something undefined on your behalf. Which seems to be what's happening here.

One plausible reason you're seeing this on some platforms but not others is that the working platforms are all using narrow Unicode, meaning this problem can't possibly occur. (You can't have a code point > 0x10FFFF on a platform where code points are only 2 bytes, except by using UTF-16 surrogate bytes, so you'd presumably get the exception—or the ignore—at surrogate encoding.)

The fact that the illegal character you're getting is \U089d89df, and the character you get on a Mac (where the system Python build is narrow-Unicode) is \u89df, is pretty suggestive of some code taking a shortcut somewhere that assumes narrow Unicode. But to actually track down the bug, you'd need to look through multiple places in the source, and compare how Python is built on each platform (narrow-vs.-wide might not be the only difference), and/or look through bugs and change logs…

And ultimately, if you want to run on Ubuntu systems with that Python build, unless you want to write your own custom C module, you'll have to work around that bug, right?

So you're probably just looking for a simple workaround. In that case, this should work:

decoded = u''.join(c for c in decoded if ord(c) <= 10FFFF)

This strips out any characters whose code point is larger than the largest legal Unicode character. It should solve the problem anywhere it exists, and be harmless (except for some wasted CPU time) otherwise.

For many applications, you really only need to deal with BMP characters, and anything in the supplementary and private planes is more likely to be an error than actual data, so it may be simpler or more robust to use 0xFFFF instead of 0x10FFFF.

Python 2.7.3 UTF-8 Encoding Irreversible

1 Answers1