Normalize one (umlaut) character results in two

Asked Jan 09 '23 at 16:08

Active Jan 09 '23 at 16:08

Viewed 30 times

I assume I didn't fully understand unicodedata.normalize() function in Python.

from unicodedata import normalize

result = normalize('NFKD', 'Ä')

print(result)  # 'A'

print(len(result))  # 2

print(result == 'A')  # False

print(result[0] == 'A')  # True

I'm confused why the len() is 2 instead of 1.

asked Jan 09 '23 at 16:08

buhtz

10,774
18
76
149

Given other comments (https://stackoverflow.com/questions/16467479/normalizing-unicode) it seems to have been decomposed into the letter A and the umlauts separately, the second character should be the umlaut. – Enrico Agrusti Jan 09 '23 at 16:16
2

`print(result)` output `Ä` for me. – chepner Jan 09 '23 at 16:16
You de-composited the char https://unicode.org/reports/tr15/#Norm_Forms, you probably want this: https://stackoverflow.com/a/2633310/10513287 – ivvija Jan 09 '23 at 16:17
`result` should consist of U+0041 and U+0308. It's not clear why `print(result)` only displays the first one. – chepner Jan 09 '23 at 16:19
1

Check the code page of your terminal. Try `chcp 65001` – Thomas Weller Jan 09 '23 at 16:26

Normalize one (umlaut) character results in two

0 Answers0