It's Unicode normalization.
For demonstration used script from my former answer (omitted all the parse.quote
and parse.unquote
junk).
encodeuni.py 서울용마초등학교
raw 서울용마초등학교 8 \uc11c\uc6b8\uc6a9\ub9c8\ucd08\ub4f1\ud559\uad50
NFC 서울용마초등학교 8 \uc11c\uc6b8\uc6a9\ub9c8\ucd08\ub4f1\ud559\uad50
NFKC 서울용마초등학교 8 \uc11c\uc6b8\uc6a9\ub9c8\ucd08\ub4f1\ud559\uad50
NFD 서울용마초등학교 20 \u1109\u1165\u110b\u116e\u11af\u110b\u116d\u11bc\u1106\u1161\u110e\u1169\u1103\u1173\u11bc\u1112\u1161\u11a8\u1100\u116d
NFKD 서울용마초등학교 20 \u1109\u1165\u110b\u116e\u11af\u110b\u116d\u11bc\u1106\u1161\u110e\u1169\u1103\u1173\u11bc\u1112\u1161\u11a8\u1100\u116d

The characters. Note that column CodePoint
contains Unicode (U+hhhh
) and UTF-8 bytes.
NFC, NFKC:
Char CodePoint Category Description
---- --------- -------- -----------
서 {U+C11C, 0xEC,0x84,0x9C} OtherLetter Hangul Syllable Sios Eo
울 {U+C6B8, 0xEC,0x9A,0xB8} OtherLetter Hangul Syllable Ieung U Rieul
용 {U+C6A9, 0xEC,0x9A,0xA9} OtherLetter Hangul Syllable Ieung Yo Ieung
마 {U+B9C8, 0xEB,0xA7,0x88} OtherLetter Hangul Syllable Mieum A
초 {U+CD08, 0xEC,0xB4,0x88} OtherLetter Hangul Syllable Chieuch O
등 {U+B4F1, 0xEB,0x93,0xB1} OtherLetter Hangul Syllable Tikeut Eu Ieung
학 {U+D559, 0xED,0x95,0x99} OtherLetter Hangul Syllable Hieuh A Kiyeok
교 {U+AD50, 0xEA,0xB5,0x90} OtherLetter Hangul Syllable Kiyeok Yo
NFD, NFKD:
Char CodePoint Category Description
---- --------- -------- -----------
ᄉ {U+1109, 0xE1,0x84,0x89} OtherLetter Hangul Choseong Sios
ᅥ {U+1165, 0xE1,0x85,0xA5} OtherLetter Hangul Jungseong Eo
ᄋ {U+110B, 0xE1,0x84,0x8B} OtherLetter Hangul Choseong Ieung
ᅮ {U+116E, 0xE1,0x85,0xAE} OtherLetter Hangul Jungseong U
ᆯ {U+11AF, 0xE1,0x86,0xAF} OtherLetter Hangul Jongseong Rieul
ᄋ {U+110B, 0xE1,0x84,0x8B} OtherLetter Hangul Choseong Ieung
ᅭ {U+116D, 0xE1,0x85,0xAD} OtherLetter Hangul Jungseong Yo
ᆼ {U+11BC, 0xE1,0x86,0xBC} OtherLetter Hangul Jongseong Ieung
ᄆ {U+1106, 0xE1,0x84,0x86} OtherLetter Hangul Choseong Mieum
ᅡ {U+1161, 0xE1,0x85,0xA1} OtherLetter Hangul Jungseong A
ᄎ {U+110E, 0xE1,0x84,0x8E} OtherLetter Hangul Choseong Chieuch
ᅩ {U+1169, 0xE1,0x85,0xA9} OtherLetter Hangul Jungseong O
ᄃ {U+1103, 0xE1,0x84,0x83} OtherLetter Hangul Choseong Tikeut
ᅳ {U+1173, 0xE1,0x85,0xB3} OtherLetter Hangul Jungseong Eu
ᆼ {U+11BC, 0xE1,0x86,0xBC} OtherLetter Hangul Jongseong Ieung
ᄒ {U+1112, 0xE1,0x84,0x92} OtherLetter Hangul Choseong Hieuh
ᅡ {U+1161, 0xE1,0x85,0xA1} OtherLetter Hangul Jungseong A
ᆨ {U+11A8, 0xE1,0x86,0xA8} OtherLetter Hangul Jongseong Kiyeok
ᄀ {U+1100, 0xE1,0x84,0x80} OtherLetter Hangul Choseong Kiyeok
ᅭ {U+116D, 0xE1,0x85,0xAD} OtherLetter Hangul Jungseong Yo
The script:
import sys
from unicodedata import normalize
def encodeuni(s):
'''
Returns input string encoded to escape sequences as in a string literal.
Output is similar to
str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
but even every ASCII character is encoded as a \\xNN escape sequence
(except a space character). For instance:
s = 'A á ř ';
encodeuni(s); # '\\x41 \\xe1 \\u0159 \\U0001f308' while
str(s.encode('unicode_escape')).lstrip('b').strip("'").replace('\\\\','\\');
# # 'A \\xe1 \\u0159 \\U0001f308'
'''
def encodechar(ch):
ordch = ord(ch)
return ( ch if ordch == 0x20 else
f"\\x{ordch:02x}" if ordch <= 0xFF else
f"\\u{ordch:04x}" if ordch <= 0xFFFF else
f"\\U{ordch:08x}" )
return ''.join([encodechar(ch) for ch in s])
if len(sys.argv) >= 2 and sys.argv[1] != '':
letters = (' '.join(
[sys.argv[i] for i in range(1,len(sys.argv))])).strip()
# .\SO\59979037.py ÅÅÅ
else:
letters = '\u212B \u00C5 \u0041\u030A \U0001f308'
# \u212B Å Angstrom Sign
# \u00C5 Å Latin Capital Letter A With Ring Above
# \u0041 A Latin Capital Letter A
# \u030A ̊ Combining Ring Above
# \U0001f308 Rainbow
print('\t'.join( ['raw' ,
letters.ljust(10),
str(len(letters)),
encodeuni(letters),'\n']))
for form in ['NFC','NFKC','NFD','NFKD']:
letnorm = normalize(form, letters)
print( '\t'.join( [form,
letnorm.ljust(10),
str(len(letnorm)),
encodeuni(letnorm)]))