Fixed-length encoding in Python 3

Question

I am currently working on an encryption/decryption program in python 3 and it works fine with strings; however, I am getting some problems converting it to use byte strings as in in UTF-8 a character can be expressed in anywhere from 1 to 4 bytes.

>>>'\u0123'.encode('utf-8')
b'\xc4\xa3'
>>>'\uffff'.encode('utf-8')
b'\xef\xbf\xbf'

After some research, I found out that there is currently no encoding in python 3 that has a fixed length for every byte and has all the characters in UTF-8 - is there any module/function that I can use to get around this problem (like by attaching empty bytes so that each charter encodes to a byte string of length 4)?

UTF-8 is a variable-length encoding. So no, there is no encoding *anywhere in the world* that is both fixed length and UTF-8. — Martijn Pieters, Nov 18 '15 at 18:16
UTF-16 doesn't work - '\u0123'.encode('utf-16') gives b'\xff\xfe#\x01' and '\uffff'.encode('utf-16') gives b'\xff\xfe\xff\xff'. What am I doing terribly wrong? — Vladimir Shevyakov, Nov 18 '15 at 18:19
@VladimirShevyakov: that's the BOM; it is always included in a UTF-16 encoding. — Martijn Pieters, Nov 18 '15 at 18:20
For those who may need it, I have implemented a [fixed-length encoding of an integer](https://stackoverflow.com/a/54152763/832230), not of a string. — Asclepius, Jan 12 '19 at 19:27

Martijn Pieters · Accepted Answer · 2015-11-18T18:37:11.263

UTF-8 is an encoding that will always use a variable number of bytes; how many depends on the unicode codepoints of the input text.

If you need a fixed length encoding that can handle Unicode, use UTF-32 (UTF-16 still uses either 2 or 4 bytes per codepoint).

Note that both UTF-16 and UTF-32 encodings include a Byte Order Mark code unit; an initial U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint that lets a decoder know if the bytes were produced in little or big endian order. This codepoint will always be 4 bytes for UTF-32, so your output is going to be 4 + (4 * character count) long.

You can encode to a specific byte order by adding -le or -be to the codec, in which case the BOM is omitted:

>>> 'Hello world'.encode('utf-32')
b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> 'Hello world'.encode('utf-32-le')
b'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> 'Hello world'.encode('utf-32-be')
b'\x00\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d'

'\uffff'.encode('utf-32') gives b'\xff\xfe\x00\x00\xff\xff\x00\x00' and '\u0123'.encode('utf-32') gives b'\xff\xfe\x00\x00#\x01\x00\x00'. What does the # do? — Vladimir Shevyakov, Nov 18 '15 at 18:20
@VladimirShevyakov: again, that's the BOM being included. I'll update (but be patient, on a train and the connectivity is variable). — Martijn Pieters, Nov 18 '15 at 18:21

Fixed-length encoding in Python 3

1 Answers1

Linked