0

Is there a builtin way to "convert" a bytestring to a unicode string? I don't want to decode it, I want the string i see on print without the "b".

e.g. Input:

b'\xb5\xb5\xb5\xb5\r\n1'

output:

'\xb5\xb5\xb5\xb5\r\n1'  

I've tried iterating over the byte string, but that gives me a list of integers:

my_bytestring = b'%PDF-1.4\n%\x93\x8c\x8b\x9e'

my_string = ""
my_list = []
for char in my_bytestring:
    my_list.append(char)
    my_string += str(char)
print(my_list)   # -> list of ints
print(my_string) # -> string of converted ints

I get:

[37, 80, 68, 70, 45, 49, 46, 52, 10, 37, 147, 140, 139, 158]

I want:

['%', 'P', 'D', 'F', '-', '1', '.', '4', '\\', 'n', '%', '\\', 'x', '9', '3', '\\', 'x', '8', 'c', '\\', 'x', '8', 'b', '\\', 'x', '9', 'e']
Yobmod
  • 395
  • 3
  • 5
  • 18
  • But they're both technically the same string... c.f: https://stackoverflow.com/questions/7262828/python-how-to-convert-string-literal-to-raw-string-literal – TerryA Apr 26 '18 at 10:10
  • None of the answers there do what I want though. They all decode or start from a unicode string. I amended the question to show what i get vs what i need. – Yobmod Apr 26 '18 at 10:17
  • Where is the bytestring coming from? I.E: Why can't you just do `r'...'` and not `b'...'` – TerryA Apr 26 '18 at 10:23
  • 1
    Do you want the result to contain literal `\ `, `x`, `b`, etc.? – tobias_k Apr 26 '18 at 10:24
  • The bytestrings are files with unknown encodings. I need them as unicode strings to use as windows filenames. Python can obviously do it, or it wouldn't be possible to represent the bytestring on my screen (which IS unicode). – Yobmod Apr 26 '18 at 10:28
  • I could've sworn your expected output was a raw string just a little while ago... – Aran-Fey Apr 26 '18 at 10:30
  • @tobias_k. With. I can always strip them if they are there, once i have a list/string – Yobmod Apr 26 '18 at 10:31
  • 1
    You're asking two different questions here. The first string is treated like a normal string (i.e. `b'\xb5'` becomes `'\xb5'`), while the 2nd string is treated like a raw string (i.e. `b'\xb5'` becomes `r'\xb5'`). – Aran-Fey Apr 26 '18 at 10:41
  • I should have just used 2 examples that couldn't 'unicode-escape'. I wasn't expecting so many responses saying to use decode after I said it didn't work, lol. (I wont change it now it has answers tho) – Yobmod Apr 26 '18 at 10:50

3 Answers3

2

Use the [Python]: chr(i) function:

>>> b = b"\xb5\xb5\xb5\xb5\r\n1"
>>> s = "".join([chr(i) for i in b])
>>> s
'µµµµ\r\n1'
>>> len(b), len(s)
(7, 7)

As @hop mentioned, it would be better to use this method:

>>> s0 = b.decode(encoding="unicode_escape")
>>> s0
'µµµµ\r\n1'
>>> len(s0)
7

However, looking at your 2nd example, it seems you need [Python]: repr(object):

>>> my_bytestring = b'%PDF-1.4\n%\x93\x8c\x8b\x9e'
>>> l = [i for i in repr(my_bytestring)][2:-1]
>>> l
['%', 'P', 'D', 'F', '-', '1', '.', '4', '\\', 'n', '%', '\\', 'x', '9', '3', '\\', 'x', '8', 'c', '\\', 'x', '8', 'b', '\\', 'x', '9', 'e']
>>> len(my_bytestring), len(l)
(14, 27)
CristiFati
  • 38,250
  • 9
  • 50
  • 87
1

Technically you cannot get from bytes to strings without decoding, but there is a codec that does what you want:

>>> b = b'\xb5\xb5\xb5\xb5\r\n1'
>>> s = b.decode('unicode_escape')
>>> s
'µµµµ\r\n1'
>>> print(s)
µµµµ
1

There is also raw_unicode_escape. You can read about the differences in the documentation

I very much doubt that there is a use case for having binary data in a unicode string.

  • Doesn't work for the second string. Gives: UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-14: character maps to – Yobmod Apr 26 '18 at 10:37
  • @Yobmod I can't reproduce that. `b'%PDF-1.4\n%\x93\x8c\x8b\x9e'.decode('unicode_escape')` returns `'%PDF-1.4\n%\x93\x8c\x8b\x9e'`. – Aran-Fey Apr 26 '18 at 10:48
-2

The PDF payload obviously isn't utf-8 encoded, or other encodings. They are raw data, not any form of text.

BUT there is an encoding that mantains all the characters with code from 0 to 255:

data = data.decode("latin1")

This changes the data type from bytes to str.

It isn't a brilliant solution because it consumes cpu time and memory, creating a new object, but it is the only one.

It is a nuisance there isn't an instruction in Python to just change the data type, from bytes to str, without processing.

Massimo
  • 3,171
  • 3
  • 28
  • 41