0

TLDR version:

I know that Python 3 stores strings in Unicode format by default, whereas Python 2.6+ stores them as byte sequences.

# test.py

a = "\xEF\xEB"
print(a)

$ python test.py | hexdump -C
00000000  ef eb 0a

$ python3 test.py | hexdump -C
00000000  c3 af c3 ab 0a

I need to make the strings in Python 3 code to be exactly like the ones in Python 2 (i.e., containing the exact bytes as the original string without the Unicode conversion).


Longer version:

I was in the process of migrating some web server code from Python 2 to 3 and encountered a significant but hopefully easy-to-solve problem. As an example:

# test.py

from struct import pack
port = pack( '>H', 5100 ).decode( 'ISO-8859-1' )
print(port)

$ python3 test.py | hexdump -C
00000000  13 c3 ac 0a 

# test.py

from struct import pack
port = pack( '>H', 5100 )
print(port)

$ python test.py | hexdump -C
00000000  13 ec 0a

The Unicode format in which Python 3 stores strings is causing a huge problem for my apps because they were written with predetermined offsets (i.e., certain bytes are expected to be at certain places) and those offsets are now thrown off by the Unicode characters.

Is there a way to convert a Python 3 string into a "regular" string like what we had in Python 2 so that a string "\xEF\xEB" will be treated as exactly that?

AK-33
  • 167
  • 2
  • 10
  • Certain bytes are supposed to be in certain places? Sounds like you handle stdout incorrectly. You can send bytes bypassing stdout's encoding but I wouldn't suggest doing it, and instead suggest fixing your original apps. – Bharel May 13 '22 at 19:31
  • That's not it. I can't go into details, but it involves memory dump manipulation. – AK-33 May 13 '22 at 19:34
  • "I need to make the strings in Python 3 code to be exactly like the ones in Python 2 (i.e., containing the exact bytes as the original string without the Unicode conversion)." Strings don't contain bytes. The *program output* does, since all data is composed of bytes. Strings, however, are an abstraction. To write specific bytes to the standard output, please see the linked duplicate. "Is there a way to convert a Python 3 string into a "regular" string" Python 3's strings **are** "regular strings" in every meaningful sense. Python 2's treatment of byte-sequences as "strings" was a hack. – Karl Knechtel May 13 '22 at 20:10
  • 1
    But the way to *represent* a raw *sequence of bytes* (without heed to interpretation as text, or any encoding used for that purpose) is to use the built-in `bytes` type, which you can create using a *bytes literal* (prefixing the string literal with `b`). Because this *does not in any way represent text*, the rules for escape sequences are slightly different. – Karl Knechtel May 13 '22 at 20:11
  • 1
    As an aside: writing raw bytes to a network socket will generally be easier than writing a string, and writing raw bytes to a file just requires opening it in binary mode. `print`, however, is designed with the expectation of text in mind. – Karl Knechtel May 13 '22 at 20:13
  • Oh, right: note that the bytes passed to `hexdump` from the Python 3 program won't necessarily always be as you've shown. Python is translating the text in the string to the bytes *that your terminal program* expects for the purpose of representing the text - so that it can, in turn, decode that encoding again, look up the glyphs, and light up the appropriate pixels in the window. The translation will depend on the terminal's expectation (and on Python being properly configured for that expectation!). This *usually* works seamlessly, but historically a lot of Windows users have had issues. – Karl Knechtel May 13 '22 at 20:20
  • Thanks, Karl. One aspect of my question I should've mentioned is that the source of my problem involves concatenation. In Python 2, strings and raw bytes could be concatenated, so I could do something like "AB" + struct.pack( '>H', 5100) + "CD." Because strings and bytes are not compatible in Python 3, the bytes must be decoded, the result of which is in Unicode format. I'll conduct further tests with Wireshark. – AK-33 May 13 '22 at 20:21
  • (Also, even when the encoding is set up properly, [weird stuff can happen](https://superuser.com/questions/1720776/). Text is a [**hard** problem](https://www.unicode.org/reports/tr9/).) – Karl Knechtel May 13 '22 at 20:22
  • 1
    "Because strings and bytes are not compatible in Python 3, the bytes must be decoded, the result of which is in Unicode format." Instead, encode the strings, since you want a bytes result at the end anyway. Two bytes objects may be concatenated to each other, as is usual for sequences. But in any event, make sure you **think** about what encoding is used, and why; and about why there are any strings involved in the first place. – Karl Knechtel May 13 '22 at 20:23
  • "Instead, encode the strings ..." - That may just work. My code involves building a data packet, converting certain bytes into numbers and doing calculations on those numbers. Because of the string/bytes compatibility in Python 2, this was seamless. The encoding/decoding may throw off the calculations, but I'll test it out. Thanks again for the detailed responses! – AK-33 May 13 '22 at 20:37
  • 1
    It's seemless in Python 3 as well. If dealing with bytes use `bytes` objects: `b"AB" + struct.pack( '>H', 5100) + b"CD."`. Realize that Python 2 `str` == Python 3 `bytes`. Python 2 `unicode` == Python 3 `str`. Use Unicode strings for text, byte strings for data. – Mark Tolonen May 13 '22 at 21:46
  • I think I may have seriously over-complicated my problem. Thanks, everyone! – AK-33 May 14 '22 at 03:19

0 Answers0