1

I assume I didn't fully understand unicodedata.normalize() function in Python.

from unicodedata import normalize

result = normalize('NFKD', 'Ä')

print(result)  # 'A'

print(len(result))  # 2

print(result == 'A')  # False

print(result[0] == 'A')  # True

I'm confused why the len() is 2 instead of 1.

buhtz
  • 10,774
  • 18
  • 76
  • 149
  • Given other comments (https://stackoverflow.com/questions/16467479/normalizing-unicode) it seems to have been decomposed into the letter A and the umlauts separately, the second character should be the umlaut. – Enrico Agrusti Jan 09 '23 at 16:16
  • 2
    `print(result)` output `Ä` for me. – chepner Jan 09 '23 at 16:16
  • You de-composited the char https://unicode.org/reports/tr15/#Norm_Forms, you probably want this: https://stackoverflow.com/a/2633310/10513287 – ivvija Jan 09 '23 at 16:17
  • `result` should consist of U+0041 and U+0308. It's not clear why `print(result)` only displays the first one. – chepner Jan 09 '23 at 16:19
  • 1
    Check the code page of your terminal. Try `chcp 65001` – Thomas Weller Jan 09 '23 at 16:26

0 Answers0