-1

Given a character, how can we transform its UTF-8 encoding to bits in Python?

As an example, a corresponds to 01100001. I am aware of ord, but something like bin(ord('a'))[2:] returns 1100001, and it does not include 0 to the left. Of course, by zfill(8) I can make it 8 bits, but I would like to know if there is a more pythonic way of doing this. For instance, if we do not know in-advance how many bits it requires, then zfill(8) approach may not work any longer, as it may be 16 bits long.

Josh
  • 131
  • 12
  • What behaviour are you expecting if " we do not know in-advance how many bits it requires" – Abhinav Mathur Mar 14 '22 at 05:10
  • You can use f-string `f'{6:08b}'` https://stackoverflow.com/a/10411108/9050514 – deadshot Mar 14 '22 at 05:11
  • A string that could contain any permissible UTF-8 character. It may be 8 bits long or more. – Josh Mar 14 '22 at 05:11
  • `ord` and UTF-8 are quite different things, which one do you really want? – Kelly Bundy Mar 14 '22 at 05:16
  • @deadshot It does not work as expected if the character is more than 8 bits long. Consider, `Ȩ`. `ord('Ȩ')` returns 552. In binary it is, `1000101000` which is 9 bits, but I want it to return 16 bits with zeros to the left as needed. `f'{552:08b}'` return `1000101000`, which is again 9 bits long, not 16. – Josh Mar 14 '22 at 05:23
  • @KellyBundy I want the output of `ord` as a multiple of 8 in binary. I was under the assumption that `ord` returns the decimal value of UTF-8. If that is not correct, then I had the wrong assumption. But my question is specifically related to the output of `ord`. If necessary, I can edit the question and remove any mention to UTF-8. – Josh Mar 14 '22 at 05:28
  • @Josh Decimal? That's a third different thing. But yes, I'd say if you want the ord thing, you'd better remove UTF-8. – Kelly Bundy Mar 14 '22 at 05:57
  • Then again, that zfill stuff and saying "may be 16 bits long" makes no sense for what ord does. – Kelly Bundy Mar 14 '22 at 06:04

1 Answers1

1

Python 3 strings contain Unicode code points, not "UTF-8 characters". You can use ord() to get the Unicode code point value, and .encode() to convert it to UTF-8 bytes. Then format each byte as 8-digit binary text, and .join() them together. Example:

# starting and ending code points for 1-, 2-, 3- and 4-byte UTF-8.
s1 = '\x00\x7f\x80\u07ff\u0800\uffff\U00010000\U0010FFFF'

# some printable characters in each range
s2 = 'Aü马'

def utf8_bin(u):
    # format as 8-digit binary, join each byte with space
    return ' '.join([f'{i:08b}' for i in u.encode()])

for u in s1:
    col1 = f'U+{ord(u):04X}' # format Unicode codepoint, leading zeros if <4 digits.
    print(f'{col1:8} {utf8_bin(u)}')

print()

for u in s2:
    col1 = f'U+{ord(u):04X}'
    print(f'{col1:8} {u} {utf8_bin(u)}')

Output:

U+0000   00000000
U+007F   01111111
U+0080   11000010 10000000
U+07FF   11011111 10111111
U+0800   11100000 10100000 10000000
U+FFFF   11101111 10111111 10111111
U+10000  11110000 10010000 10000000 10000000
U+10FFFF 11110100 10001111 10111111 10111111

U+0041   A 01000001
U+00FC   ü 11000011 10111100
U+9A6C   马 11101001 10101001 10101100
U+1F382   11110000 10011111 10001110 10000010
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251