0

I need to decode a byte string into unicode, and insert it into a unicode string (python 2.7). When I later encode that unicode string back into bytes, the byte array must be equal to the original bytes. My question is which encoding I should use to achieve this.

Example:

#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("ascii"))
backToBytes = unicodeString.encode("ascii")
assert byteString==backToBytes

This fails with the infamous:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 128: ordinal not in range(128)

What encoding should I use here (instead of 'ascii') to preserve my byte values?

I am using "ascii" in this (currently broken) example, because it is my default encoding:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
stochastic
  • 3,155
  • 5
  • 27
  • 42

1 Answers1

1

It turns out that the 'latin1' (aka 'iso-8859-1') encoding will preserve every byte literally. This link mentions this fact, although other sources led me to believe this was false. I can confirm that running this code on python 2.7 works, demonstrating that 'iso-8859-1' does indeed preserve every possible byte:

#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("iso-8859-1"))
backToBytes = unicodeString.encode("iso-8859-1")
assert byteString==backToBytes
stochastic
  • 3,155
  • 5
  • 27
  • 42
  • 1
    The gap you see in that “other source” are control characters: U+0080 to U+009C are control characters. The same goes for bytes 0 to 31, and byte 127. – roeland May 02 '16 at 22:05
  • 1
    Python 2/3 version: `all_bytes = bytearray(range(0x100));` `all_bytes.decode('latin1').encode('latin1') == all_bytes` (note: [this encoding may be interpreted differently in different contexts e.g., python executable and a web browser may use different definitions](http://stackoverflow.com/a/19110555/4279). iso-8859-1 doesn't support all Unicode characters and therefore you can't use it to encode an arbitrary Unicode string e.g., `u"\N{SNOWMAN}".encode('latin1')` fails. To pass arbitrary bytes as Unicode, you could use something like `'surrogateescape'` handler from Python 3. – jfs May 04 '16 at 18:06