Unicode mapping of dakuten for Japanese

Question

In Japanese, there's the dakuten and han-dakuten diacritic for voiced and semi-voiced syllables, see https://en.wikipedia.org/wiki/Dakuten_and_handakuten

There's at least two dakuten characters and two handakuten characters as listed on https://en.wikipedia.org/wiki/List_of_Japanese_typographic_symbols

>>> standalone_dakuten = "\u309B"
>>> combining_dakuten = "\u3099"
>>> combining_dakuten == standalone_dakuten
False

If we compare a character that already has the dakuten, e.g. が and combining the non-voiced syllable with the combining dakuten, i.e. か + "\u3099" , it doesn't yield the same character as が although when printed it looks the same. In code:

>>> print("が") # char with dakuten implicitly.
が
>>> print("か" + "\u3099") # char with combining dakuten.
が
>>> "が" == "か" + "\u3099"
False
>>> ord("が")
12364
>>> ord("か" + "\u3099")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

My question is whether there's a function that maps the dakuten combined characters "か" + "\u3099" to implicitly dakuten-ed character "が"?

Take a look at [Unicode normalization](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize) - see [this question](https://stackoverflow.com/questions/16467479/normalizing-unicode) for an example. Using `NFC`, for example, I can compare `が` and `"か" + "\u3099"` successfully. — andrewJames, May 16 '20 at 19:26
Cool! Thanks @andrewjames, didn't know unicodedata normalize works for Japanese diacrticis too!! `unicodedata.normalize('NFC', "か"+"\u3099") == "が"` -> True =) — alvas, May 16 '20 at 19:45
Glad that worked! I will flag this as a duplicate of [Normalizing Unicode](https://stackoverflow.com/questions/16467479/normalizing-unicode) - for SO protocol. — andrewJames, May 16 '20 at 21:44

Unicode mapping of dakuten for Japanese

0 Answers0