Python 3: Represent bytestring as a string (without decoding)

Question

Is there a builtin way to "convert" a bytestring to a unicode string? I don't want to decode it, I want the string i see on print without the "b".

e.g. Input:

b'\xb5\xb5\xb5\xb5\r\n1'

output:

'\xb5\xb5\xb5\xb5\r\n1'

I've tried iterating over the byte string, but that gives me a list of integers:

my_bytestring = b'%PDF-1.4\n%\x93\x8c\x8b\x9e'

my_string = ""
my_list = []
for char in my_bytestring:
    my_list.append(char)
    my_string += str(char)
print(my_list)   # -> list of ints
print(my_string) # -> string of converted ints

I get:

[37, 80, 68, 70, 45, 49, 46, 52, 10, 37, 147, 140, 139, 158]

I want:

['%', 'P', 'D', 'F', '-', '1', '.', '4', '\\', 'n', '%', '\\', 'x', '9', '3', '\\', 'x', '8', 'c', '\\', 'x', '8', 'b', '\\', 'x', '9', 'e']

But they're both technically the same string... c.f: https://stackoverflow.com/questions/7262828/python-how-to-convert-string-literal-to-raw-string-literal — TerryA, Apr 26 '18 at 10:10
None of the answers there do what I want though. They all decode or start from a unicode string. I amended the question to show what i get vs what i need. — Yobmod, Apr 26 '18 at 10:17
Where is the bytestring coming from? I.E: Why can't you just do `r'...'` and not `b'...'` — TerryA, Apr 26 '18 at 10:23
Do you want the result to contain literal `\ `, `x`, `b`, etc.? — tobias_k, Apr 26 '18 at 10:24
The bytestrings are files with unknown encodings. I need them as unicode strings to use as windows filenames. Python can obviously do it, or it wouldn't be possible to represent the bytestring on my screen (which IS unicode). — Yobmod, Apr 26 '18 at 10:28
I could've sworn your expected output was a raw string just a little while ago... — Aran-Fey, Apr 26 '18 at 10:30
@tobias_k. With. I can always strip them if they are there, once i have a list/string — Yobmod, Apr 26 '18 at 10:31
You're asking two different questions here. The first string is treated like a normal string (i.e. `b'\xb5'` becomes `'\xb5'`), while the 2nd string is treated like a raw string (i.e. `b'\xb5'` becomes `r'\xb5'`). — Aran-Fey, Apr 26 '18 at 10:41
I should have just used 2 examples that couldn't 'unicode-escape'. I wasn't expecting so many responses saying to use decode after I said it didn't work, lol. (I wont change it now it has answers tho) — Yobmod, Apr 26 '18 at 10:50

CristiFati · Accepted Answer · 2018-04-26T11:10:23.220

2

Use the [Python]: chr(i) function:

>>> b = b"\xb5\xb5\xb5\xb5\r\n1"
>>> s = "".join([chr(i) for i in b])
>>> s
'µµµµ\r\n1'
>>> len(b), len(s)
(7, 7)

As @hop mentioned, it would be better to use this method:

>>> s0 = b.decode(encoding="unicode_escape")
>>> s0
'µµµµ\r\n1'
>>> len(s0)
7

However, looking at your 2^nd example, it seems you need [Python]: repr(object):

>>> my_bytestring = b'%PDF-1.4\n%\x93\x8c\x8b\x9e'
>>> l = [i for i in repr(my_bytestring)][2:-1]
>>> l
['%', 'P', 'D', 'F', '-', '1', '.', '4', '\\', 'n', '%', '\\', 'x', '9', '3', '\\', 'x', '8', 'c', '\\', 'x', '8', 'b', '\\', 'x', '9', 'e']
>>> len(my_bytestring), len(l)
(14, 27)

edited Apr 26 '18 at 11:10

answered Apr 26 '18 at 10:29

CristiFati

38,250
9
50
87

2

don't invent your own `.decode()` use the `unicode_escape` encoding – Apr 26 '18 at 10:33
Thanks. Just changing the str(char) to chr(char) in my code did the job! – Yobmod Apr 26 '18 at 10:33
Thank you @hop, I'll add it. – CristiFati Apr 26 '18 at 10:35
Your links look like they've been inserted by some kind of script. May I ask where I can find this script? It looks useful. – Aran-Fey Apr 26 '18 at 10:39
@Aran-Fey: Unfortunately I didn't have the time to automate it yet, so it's all manual (monkey work) :) . – CristiFati Apr 26 '18 at 10:43

score 1 · Answer 2 · 2018-04-26T10:51:53.560

1

Technically you cannot get from bytes to strings without decoding, but there is a codec that does what you want:

>>> b = b'\xb5\xb5\xb5\xb5\r\n1'
>>> s = b.decode('unicode_escape')
>>> s
'µµµµ\r\n1'
>>> print(s)
µµµµ
1

There is also raw_unicode_escape. You can read about the differences in the documentation

I very much doubt that there is a use case for having binary data in a unicode string.

edited Apr 26 '18 at 10:51

answered Apr 26 '18 at 10:35

Doesn't work for the second string. Gives: UnicodeEncodeError: 'charmap' codec can't encode characters in position 11-14: character maps to – Yobmod Apr 26 '18 at 10:37
@Yobmod I can't reproduce that. `b'%PDF-1.4\n%\x93\x8c\x8b\x9e'.decode('unicode_escape')` returns `'%PDF-1.4\n%\x93\x8c\x8b\x9e'`. – Aran-Fey Apr 26 '18 at 10:48

score -2 · Answer 3 · answered Jun 03 '23 at 21:34

The PDF payload obviously isn't utf-8 encoded, or other encodings. They are raw data, not any form of text.

BUT there is an encoding that mantains all the characters with code from 0 to 255:

data = data.decode("latin1")

This changes the data type from bytes to str.

It isn't a brilliant solution because it consumes cpu time and memory, creating a new object, but it is the only one.

It is a nuisance there isn't an instruction in Python to just change the data type, from bytes to str, without processing.

Python 3: Represent bytestring as a string (without decoding)

3 Answers3