Converting Byte to String and Back Properly in Python3?

Question

Given a random byte (i.e. not only numbers/characters!), I need to convert it to a string and then back to the inital byte without loosing information. This seems like a basic task, but I ran in to the following problems:

Assuming:

rnd_bytes = b'w\x12\x96\xb8'
len(rnd_bytes)

prints: 4

Now, converting it to a string. Note: I need to set backslashreplace as it otherwise returns a 'UnicodeDecodeError' or would loose information setting it to another flag value.

my_str = rnd_bytes.decode('utf-8' , 'backslashreplace')

Now, I have the string. I want to convert it back to exactly the original byte (size 4!):

According to python ressources and this answer, there are different possibilities:

conv_bytes = bytes(my_str, 'utf-8')
conv_bytes = my_str.encode('utf-8')

But len(conv_bytes) returns 10.

I tried to analyse the outcome:

>>> repr(rnd_bytes)
"b'w\\x12\\x96\\xb8'"
>>> repr(my_str)
"'w\\x12\\\\x96\\\\xb8'"
>>> repr(conv_bytes)
"b'w\\x12\\\\x96\\\\xb8'"

It would make sense to replace '\\\\'. my_str.replace('\\\\','\\') doesn't change anything. Probably, because four backslashes represent only two. So, my_str.replace('\\','\') would find the '\\\\', but leads to

SyntaxError: EOL while scanning string literal

due to the last argument '\'. This had been discussed here, where the following suggestion came up:

>>> my_str2=my_str.encode('utf_8').decode('unicode_escape')
>>> repr(my_str2)
"'w\\x12\\x96¸'"

This replaces the '\\\\' but seems to add / change some other characters:

>>> conv_bytes2 = my_str2.encode('utf8')
>>> len(conv_bytes2)
6
>>> repr(conv_bytes2)
"b'w\\x12\\xc2\\x96\\xc2\\xb8'"

There must be a prober way to convert a (complex) byte to a string and back. How can I achieve that?

What purpose are you trying to achieve by turning arbitrary bytes into text? — Ignacio Vazquez-Abrams, May 07 '18 at 10:06
I have a given function that needs a string as input and outputs the same string / sometimes edited string. But I receive the input string as byte and need to process it further as byte. This would go beyond the scope of the discussion here but it's a basic task so that there has to be a possibility. — black, May 07 '18 at 10:11
@legalalien It can be but it doesn't have to be. The string is only for the processing algorithm. Please feel free to write an answer if you know how to use hex to solve the conversion problem. — black, May 07 '18 at 10:53

Ozgur Bagci · Accepted Answer · 2020-11-15T10:28:47.527

16

Note: Some codes found on the Internet.

You could try to convert it to hex format. Then it is easy to convert it back to byte format.

Sample code to convert bytes to string:

hex_str = rnd_bytes.hex()

Here is how 'hex_str' looks like:

'771296b8'

And code for converting it back to bytes:

new_rnd_bytes = bytes.fromhex(hex_str)

The result is:

b'w\x12\x96\xb8'

For processing you can use:

readable_str = ''.join(chr(int(hex_str[i:i+2], 16)) for i in range(0, len(hex_str), 2))

But newer try to encode readable string, here is how readable string looks like:

'w\x12\x96¸'

After processing readable string convert it back to hex format before converting it back to bytes string like:

hex_str = ''.join([str(hex(ord(i)))[2:4] for i in readable_str])

edited Nov 15 '20 at 10:28

answered May 07 '18 at 12:29

Ozgur Bagci

768
11
25

1

Great solution! I was struggling with this for hours! Thanks! – squidg Sep 08 '21 at 12:14
It is funny, I struggled with the same thing, and came back to my answer to resolve it. – Ozgur Bagci May 11 '23 at 19:06

score 4 · Answer 2 · answered Aug 05 '22 at 04:26

Now, converting it to a string. Note: I need to set backslashreplace as it otherwise returns a 'UnicodeDecodeError' or would loose information setting it to another flag value.

The UTF-8 encoding cannot interpret every possible sequence of bytes as a string. Using backslashreplace gives you a string that preserves the information for bytes that couldn't be converted:

>>> rnd_bytes = b'w\x12\x96\xb8'
>>> rnd_bytes.decode('utf-8', 'backslashreplace')
'w\x12\\x96\\xb8'

but that representation is not very useful for converting back.

Instead, use an encoding that does interpret every possible sequence of bytes as a string. The most straightforward of these is ISO-8859-1, which simply maps each byte one at a time to the first 256 Unicode code points respectively.

>>> rnd_bytes.decode('iso-8859-1')
'w\x12\x96¸'
>>> rnd_bytes.decode('iso-8859-1').encode('iso-8859-1') == rnd_bytes
True

Converting Byte to String and Back Properly in Python3?

2 Answers2

Linked

Related