12

I have this issue and I can't figure out how to solve it. I have this string:

data = '\xc4\xb7\x86\x17\xcd'

When I tried to encode it:

data.encode()

I get this result:

b'\xc3\x84\xc2\xb7\xc2\x86\x17\xc3\x8d'

I only want:

b'\xc4\xb7\x86\x17\xcd'

Anyone knows the reason and how to fix this. The string is already stored in a variable, so I can't add the literal b in front of it.

wim
  • 338,267
  • 99
  • 616
  • 750
avan989
  • 317
  • 1
  • 5
  • 12
  • 2
    Note that *“without change in encoding”* is a misleading requirement. When converting a string into bytes or vice-versa, you *have* to take an encoding into account in order to perform the conversion. – poke Jan 22 '18 at 08:05

2 Answers2

19

You cannot convert a string into bytes or bytes into string without taking an encoding into account. The whole point about the bytes type is an encoding-independent sequence of bytes, while str is a sequence of Unicode code points which by design have no unique byte representation.

So when you want to convert one into the other, you must tell explicitly what encoding you want to use to perform this conversion. When converting into bytes, you have to say how to represent each character as a byte sequence; and when you convert from bytes, you have to say what method to use to map those bytes into characters.

If you don’t specify the encoding, then UTF-8 is the default, which is a sane default since UTF-8 is ubiquitous, but it's also just one of many valid encodings.

If you take your original string, '\xc4\xb7\x86\x17\xcd', take a look at what Unicode code points these characters represent. \xc4 for example is the LATIN CAPITAL LETTER A WITH DIAERESIS, i.e. Ä. That character happens to be encoded in UTF-8 as 0xC3 0x84 which explains why that’s what you get when you encode it into bytes. But it also has an encoding of 0x00C4 in UTF-16 for example.


As for how to solve this properly so you get the desired output, there is no clear correct answer. The solution that Kasramvd mentioned is also somewhat imperfect. If you read about the raw_unicode_escape codec in the documentation:

raw_unicode_escape

Latin-1 encoding with \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol.

So this is just a Latin-1 encoding which has a built-in fallback for characters outside of it. I would consider this fallback somewhat harmful for your purpose. For Unicode characters that cannot be represented as a \xXX sequence, this might be problematic:

>>> chr(256).encode('raw_unicode_escape')
b'\\u0100'

So the code point 256 is explicitly outside of Latin-1 which causes the raw_unicode_escape encoding to instead return the encoded bytes for the string '\\u0100', turning that one character into 6 bytes which have little to do with the original character (since it’s an escape sequence).

So if you wanted to use Latin-1 here, I would suggest you to use that one explictly, without having that escape sequence fallback from raw_unicode_escape. This will simply cause an exception when trying to convert code points outside of the Latin-1 area:

>>> '\xc4\xb7\x86\x17\xcd'.encode('latin1')
b'\xc4\xb7\x86\x17\xcd'
>>> chr(256).encode('latin1')
Traceback (most recent call last):
  File "<pyshell#28>", line 1, in <module>
    chr(256).encode('latin1')
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0100' in position 0: ordinal not in range(256)

Of course, whether or not code points outside of the Latin-1 area can cause problems for you depends on where that string actually comes from. But if you can make guarantees that the input will only contain valid Latin-1 characters, then chances are that you don't really need to be working with a string there in the first place. Since you are actually dealing with some kind of bytes, you should look whether you cannot simply retrieve those values as bytes in the first place. That way you won’t introduce two levels of encoding there where you can corrupt data by misinterpreting the input.

wim
  • 338,267
  • 99
  • 616
  • 750
poke
  • 369,085
  • 72
  • 557
  • 602
9

You can use 'raw_unicode_escape' as your encoding:

In [14]: bytes(data, 'raw_unicode_escape')
Out[14]: b'\xc4\xb7\x86\x17\xcd'

As mentioned in comments you can also pass the encoding directly to the encode method of your string.

In [15]: data.encode("raw_unicode_escape")
Out[15]: b'\xc4\xb7\x86\x17\xcd'
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • @Jean-FrançoisFabre That's even better in this case! – Mazdak Jan 21 '18 at 13:21
  • @avan989 don't "thank you". accept the answer instead. – Jean-François Fabre Jan 21 '18 at 13:21
  • those string <=> bytes conversions are really hell :) can you explain why performing a default encoding adds this trash? (honest question, I don't know the answer). If you can't, well, never mind. – Jean-François Fabre Jan 21 '18 at 13:24
  • @Jean-FrançoisFabre What you mean exactly by *hell* and *trash* because there are many problems with such conversions depend on the situation! ;)) – Mazdak Jan 21 '18 at 13:32
  • @Jean-FrançoisFabre The main problem IMHO is that bytes are immutable sequence of integers and as it's stated in doc, only ASCII characters are permitted in bytes literals (regardless of the declared source code encoding). Any binary values over 127 must be entered into bytes literals using the appropriate escape sequence. Now, specially in python-3.X, you need to convert between all these escaped literals and all the possible unicodes and also converting the types between string and integer, etc. All those processes will take a lot of checking and time. – Mazdak Jan 21 '18 at 13:38