5

For example given an arbitrary string. Could be chars or just random bytes:

string = '\xf0\x9f\xa4\xb1'

I want to output:

b'\xf0\x9f\xa4\xb1'

This seems so simple, but I could not find an answer anywhere. Of course just typing the b followed by the string will do. But I want to do this runtime, or from a variable containing the strings of byte.

if the given string was AAAA or some known characters I can simply do string.encode('utf-8'), but I am expecting the string of bytes to just be random. Doing that to '\xf0\x9f\xa4\xb1' ( random bytes ) produces unexpected result b'\xc3\xb0\xc2\x9f\xc2\xa4\xc2\xb1'.

There must be a simpler way to do this?

Edit:

I want to convert the string to bytes without using an encoding

sjakobi
  • 3,546
  • 1
  • 25
  • 43
AznBoyStride
  • 305
  • 2
  • 12
  • Do you want to convert the string to bytes? It is not clear what the desired solution is... if you know it is a byte string without the b, you can do some string formatting. If you need it in bytes, you can call `bytes(string)`. Does this help: https://stackoverflow.com/questions/606191/convert-bytes-to-a-string ? – Scott Skiles Aug 08 '18 at 20:06
  • Yes I want to simply convert the string to bytes – AznBoyStride Aug 08 '18 at 20:07
  • Okay I see your problem. You might need to use a raw string – Scott Skiles Aug 08 '18 at 20:11
  • The `bytes` function takes in a `string` and an `encoding`. Since the bytes I'm expecting are random, I don't want to pick an encoding for it – AznBoyStride Aug 08 '18 at 20:13

2 Answers2

5

The Latin-1 character encoding trivially (and unlike every other encoding supported by Python) encodes every code point in the range 0x00-0xff to a byte with the same value.

byteobj = '\xf0\x9f\xa4\xb1'.encode('latin-1')

You say you don't want to use an encoding, but the alternatives which avoid it seem far inferior.

The UTF-8 encoding is unsuitable because, as you already discovered, code points above 0x7f map to a sequence of multiple bytes (up to four bytes) none of which are exactly the input code point as a byte value.

Omitting the argument to .encode() (as in a now-deleted answer) forces Python to guess an encoding, which produces system-dependent behavior (probably picks UTF-8 on most systems except Windows, where it will typically instead choose something much more unpredictable, as well as usually much more sinister and horrible).

tripleee
  • 175,061
  • 34
  • 275
  • 318
3

I found a working solution

import struct

def convert_string_to_bytes(string):
    bytes = b''
    for i in string:
        bytes += struct.pack("B", ord(i))
    return bytes       

string = '\xf0\x9f\xa4\xb1'

print (convert_string_to_bytes(string)))

output: b'\xf0\x9f\xa4\xb1'

AznBoyStride
  • 305
  • 2
  • 12
  • b'\'\\x1e\\x03\\xcd\\xb6\\x93:\\x87\\xfc\\xcfp\\xfc\\xb7\\xba\\x8a\\x0es\\x81P\\xe1\\x1b\\n4a\\xe4"\\xdfA\\x8e\\x8a\\x15\\x18\\xb8\\x12\\xfcB/\\xea\\x83\\xd4\\x1dd\\xb8\\x14\\xd3\\xb9\\xfa\\x97B\\xfe\\x89\\xe1\\xff\\xbe\\x02\\xedY\\xc9pk\\\'\\xf8\\x1d9\\x1a\'' output is like this – Sadique Khan Nov 11 '21 at 08:23