4

I have the following str:
"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"

This comes from a filename: Расшифровка_RootKit.com_63k.txt

My problem is a cannot reverse the first str to the second one. I have tried a few things, using en/decode(), bytes(), etc but I did not manage.

One thing I noticed was b'' and bytes() have different outputs:

path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
bpath = bytes(path, "UTF-8")
print(bpath.decode("UTF-8"))
print(b"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt".decode('utf8'))

Results:

РаÑÑиÑ
         Ñовка_RootKit.com_63k.txt
Расшифровка_RootKit.com_63k.txt

So I wonder what is the difference between b'' and bytes(); maybe it will help me solving my problem !

Chocorean
  • 807
  • 1
  • 8
  • 25

4 Answers4

3

b'' is a prefix, that causes the following string to be interpreted as a bytes-type object. The bytes function takes a string and returns a bytes object.

print(b"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt".decode

This works, because you are decoding a bytes object.

path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"
bpath = bytes(path, "UTF-8")
print(bpath.decode("UTF-8"))

This does not work as intended, because you are treating path as a string, then converting it into a bytes object, then trying to decode what comes out.

3ch0
  • 173
  • 1
  • 7
  • Right, I now understand. Thank you – Chocorean Sep 04 '19 at 09:30
  • This doesn't explain anything. It would not be strange even if those two results are the same. The Python string and b'' interpret differently, that's the reason the results are different. Please see @awesoon's answer. – starriet Jul 21 '22 at 15:29
3

You may want to use solution with latin1, scroll to that answer firstly. This answer works if you accidentally copied bytes content and pasted as a string.

If you want to convert them back to bytes, use the following code:

In [22]: path = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"

In [23]: bytes(map(ord, path)).decode('utf-8')
Out[23]: 'Расшифровка_RootKit.com_63k.txt'

Explanation is quite simple, lets use the first character from the string:

In [40]: '\xd0'
Out[40]: 'Ð'

In [41]: b'\xd0'
Out[41]: b'\xd0'

As you can see, string converts \xd0 to a unicode character with number 0xd0, while bytes just interprets this as a single byte.

UTF-8 uses the following mask for all characters between U+0080 and U+07FF: 110xxxxx for the first byte and 10xxxxxx for the second byte. This is exactly what you gets when directly converting that string to bytes:

In [43]: [bin(x) for x in '\xd0'.encode('utf-8')]
Out[43]: ['0b11000011', '0b10010000']

And the actual symbol code is 00011 + 010000 (concatenation, not addition), which is 0xd0:

In [44]: hex(int('00011010000', 2))
Out[44]: '0xd0'

To get this number from a character we can use ord:

In [45]: hex(ord('\xd0'))
Out[45]: '0xd0'

And then just applying it to the whole string and converting it back to bytes:

In [46]: bytes(map(ord, path)).decode('utf-8')
Out[46]: 'Расшифровка_RootKit.com_63k.txt'

Note that if your string character does not fit in byte for some reason the code above will raise an error:

In [47]: bytes([ord(chr(256))])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-49-5555e18dbece> in <module>
----> 1 bytes([ord(chr(256))])

ValueError: bytes must be in range(0, 256)
awesoon
  • 32,469
  • 11
  • 74
  • 99
2

To convert your string, just encode it to bytes using 'latin1' that has a 1 to 1 mapping between bytes and characters, and decode using 'utf8':

s = "\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt"

s.encode('latin1').decode('utf8')

# 'Расшифровка_RootKit.com_63k.txt'
Thierry Lathuille
  • 23,663
  • 10
  • 44
  • 50
1

path variable is a string (not Bytes). When you use the method bytes() you are decoding it to bytes which will return b'\xc3\x90\xc2\xa0\xc3\x90\xc2\xb0\xc3\x91\xc2\x81\xc3\x91\xc2\x88\xc3\x90\xc2\xb8\xc3\x91\xc2\x84\xc3\x91\xc2\x80\xc3\x90\xc2\xbe\xc3\x90\xc2\xb2\xc3\x90\xc2\xba\xc3\x90\xc2\xb0_RootKit.com_63k.txt'

But when your are writing b"\xd0\xa0\xd0\xb0\xd1\x81\xd1\x88\xd0\xb8\xd1\x84\xd1\x80\xd0\xbe\xd0\xb2\xd0\xba\xd0\xb0_RootKit.com_63k.txt" you are refering to the Bytes value of Расшифровка_RootKit.com_63k.txt

Adirmola
  • 783
  • 5
  • 15
  • Right. Is there a way to turn my string into bytes without decoding it ? (I know this has not a lot of sense-) – Chocorean Sep 04 '19 at 09:24
  • As you say " this has not a lot of sense". I'm still struggling to understand what you are trying to do? what is your input and what is you desired output? – Adirmola Sep 04 '19 at 09:28
  • The file I'm talking about is uploaded to a flask api, but its name turns into my variable `path`, and flask fails saving the file, so I'm trying to rename it. – Chocorean Sep 04 '19 at 09:30