0

I am currently working on an encryption/decryption program in python 3 and it works fine with strings; however, I am getting some problems converting it to use byte strings as in in UTF-8 a character can be expressed in anywhere from 1 to 4 bytes.

>>>'\u0123'.encode('utf-8')
b'\xc4\xa3'
>>>'\uffff'.encode('utf-8')
b'\xef\xbf\xbf'

After some research, I found out that there is currently no encoding in python 3 that has a fixed length for every byte and has all the characters in UTF-8 - is there any module/function that I can use to get around this problem (like by attaching empty bytes so that each charter encodes to a byte string of length 4)?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Vladimir Shevyakov
  • 2,511
  • 4
  • 19
  • 40

1 Answers1

2

UTF-8 is an encoding that will always use a variable number of bytes; how many depends on the unicode codepoints of the input text.

If you need a fixed length encoding that can handle Unicode, use UTF-32 (UTF-16 still uses either 2 or 4 bytes per codepoint).

Note that both UTF-16 and UTF-32 encodings include a Byte Order Mark code unit; an initial U+FEFF ZERO WIDTH NO-BREAK SPACE codepoint that lets a decoder know if the bytes were produced in little or big endian order. This codepoint will always be 4 bytes for UTF-32, so your output is going to be 4 + (4 * character count) long.

You can encode to a specific byte order by adding -le or -be to the codec, in which case the BOM is omitted:

>>> 'Hello world'.encode('utf-32')
b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> 'Hello world'.encode('utf-32-le')
b'H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00'
>>> 'Hello world'.encode('utf-32-be')
b'\x00\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • '\uffff'.encode('utf-32') gives b'\xff\xfe\x00\x00\xff\xff\x00\x00' and '\u0123'.encode('utf-32') gives b'\xff\xfe\x00\x00#\x01\x00\x00'. What does the # do? – Vladimir Shevyakov Nov 18 '15 at 18:20
  • @VladimirShevyakov: again, that's the BOM being included. I'll update (but be patient, on a train and the connectivity is variable). – Martijn Pieters Nov 18 '15 at 18:21