I've come across a few very troublesome strings while crawling the web. In particular, a page advertises as being UTF-7
, and though it's not quite UTF-7
that doesn't appear to be the issue. I'm not concerned with representing the exact intent of the text, but I just need to get into UTF-8
for downstream consumption.
The oddity I'm faced with is that I'm able to get a unicode
string that cannot be first UTF-8
encoded and then decoded. I've distilled the string down as much as I can while still exhibiting the error:
bytes = [43, 105, 100, 41, 46, 101, 95, 39, 43, 105, 100, 43]
string = ''.join(chr(c) for c in bytes)
# This particular string happens to be advertised as UTF-7, though it is
# a bit malformed. We'll ignore these errors when decoding it.
decoded = string.decode('utf-7', 'ignore')
# This decoded string, however, cannot be encoded into UTF-8 and back:
error = decoded.encode('utf-8').decode('utf-8')
I've tried this on a number of systems successfully: Python 2.7.1 and 2.6.7 on Mac 10.5.7, Python 2.7.2 and 2.6.8 on CentOS. Unfortunately, on the machines we need it to work on it's failing with Python 2.7.3 on Ubuntu 12.04. On the failing system, I see:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf7 in position 4: invalid start byte
Here are some of the intermediate values that I see on the working vs. non-working systems:
# Working:
>>> repr(decoded)
'u".e_\'\\u89df"'
>>> repr(decoded.encode('utf-8'))
'".e_\'\\xe8\\xa7\\x9f"'
# Non-working:
>>> repr(decoded)
'u".e_\'\\U089d89df"'
>>> repr(decoded.encode('utf-8'))
'".e_\'\\xf7\\x98\\xa7\\x9f"'
The two are different after the first encoding, though why is a mystery to me still. I imagine that it's an issue with lacking some character tables, or an auxiliary library because it doesn't appear that anything between 2.7.2 and 2.7.3 would explain this behavior. On the systems where it works correctly, printing the unicode entity displays a Chinese symbol, but a placeholder on the system where it does not.
This leaves me to my question: does such an issue look familiar to anyone, or does anyone have an idea what supporting libraries I might be missing on the system that's having the issue?