python unicode: what codec will preserve literal byte values?

Question

I need to decode a byte string into unicode, and insert it into a unicode string (python 2.7). When I later encode that unicode string back into bytes, the byte array must be equal to the original bytes. My question is which encoding I should use to achieve this.

Example:

#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("ascii"))
backToBytes = unicodeString.encode("ascii")
assert byteString==backToBytes

This fails with the infamous:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 128: ordinal not in range(128)

What encoding should I use here (instead of 'ascii') to preserve my byte values?

I am using "ascii" in this (currently broken) example, because it is my default encoding:

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

score 1 · Answer 1 · answered May 02 '16 at 18:56

1

It turns out that the 'latin1' (aka 'iso-8859-1') encoding will preserve every byte literally. This link mentions this fact, although other sources led me to believe this was false. I can confirm that running this code on python 2.7 works, demonstrating that 'iso-8859-1' does indeed preserve every possible byte:

#every possible byte
byteString = b"".join([chr(ii) for ii in xrange(256)])
unicodeString = u"{0}".format(byteString.decode("iso-8859-1"))
backToBytes = unicodeString.encode("iso-8859-1")
assert byteString==backToBytes

answered May 02 '16 at 18:56

stochastic

3,155
5
27
42

1

The gap you see in that “other source” are control characters: U+0080 to U+009C are control characters. The same goes for bytes 0 to 31, and byte 127. – roeland May 02 '16 at 22:05
1

Python 2/3 version: `all_bytes = bytearray(range(0x100));` `all_bytes.decode('latin1').encode('latin1') == all_bytes` (note: [this encoding may be interpreted differently in different contexts e.g., python executable and a web browser may use different definitions](http://stackoverflow.com/a/19110555/4279). iso-8859-1 doesn't support all Unicode characters and therefore you can't use it to encode an arbitrary Unicode string e.g., `u"\N{SNOWMAN}".encode('latin1')` fails. To pass arbitrary bytes as Unicode, you could use something like `'surrogateescape'` handler from Python 3. – jfs May 04 '16 at 18:06

python unicode: what codec will preserve literal byte values?

1 Answers1