Python 3 strings contain Unicode code points, not "UTF-8 characters". You can use ord()
to get the Unicode code point value, and .encode()
to convert it to UTF-8 bytes. Then format each byte as 8-digit binary text, and .join()
them together. Example:
# starting and ending code points for 1-, 2-, 3- and 4-byte UTF-8.
s1 = '\x00\x7f\x80\u07ff\u0800\uffff\U00010000\U0010FFFF'
# some printable characters in each range
s2 = 'Aü马'
def utf8_bin(u):
# format as 8-digit binary, join each byte with space
return ' '.join([f'{i:08b}' for i in u.encode()])
for u in s1:
col1 = f'U+{ord(u):04X}' # format Unicode codepoint, leading zeros if <4 digits.
print(f'{col1:8} {utf8_bin(u)}')
print()
for u in s2:
col1 = f'U+{ord(u):04X}'
print(f'{col1:8} {u} {utf8_bin(u)}')
Output:
U+0000 00000000
U+007F 01111111
U+0080 11000010 10000000
U+07FF 11011111 10111111
U+0800 11100000 10100000 10000000
U+FFFF 11101111 10111111 10111111
U+10000 11110000 10010000 10000000 10000000
U+10FFFF 11110100 10001111 10111111 10111111
U+0041 A 01000001
U+00FC ü 11000011 10111100
U+9A6C 马 11101001 10101001 10101100
U+1F382 11110000 10011111 10001110 10000010