1

In Japanese, there's the dakuten and han-dakuten diacritic for voiced and semi-voiced syllables, see https://en.wikipedia.org/wiki/Dakuten_and_handakuten

There's at least two dakuten characters and two handakuten characters as listed on https://en.wikipedia.org/wiki/List_of_Japanese_typographic_symbols

>>> standalone_dakuten = "\u309B"
>>> combining_dakuten = "\u3099"
>>> combining_dakuten == standalone_dakuten
False

If we compare a character that already has the dakuten, e.g. and combining the non-voiced syllable with the combining dakuten, i.e. + "\u3099" , it doesn't yield the same character as although when printed it looks the same. In code:

>>> print("が") # char with dakuten implicitly.
が
>>> print("か" + "\u3099") # char with combining dakuten.
が
>>> "が" == "か" + "\u3099"
False
>>> ord("が")
12364
>>> ord("か" + "\u3099")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

My question is whether there's a function that maps the dakuten combined characters "か" + "\u3099" to implicitly dakuten-ed character "が"?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    Take a look at [Unicode normalization](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize) - see [this question](https://stackoverflow.com/questions/16467479/normalizing-unicode) for an example. Using `NFC`, for example, I can compare `が` and `"か" + "\u3099"` successfully. – andrewJames May 16 '20 at 19:26
  • Cool! Thanks @andrewjames, didn't know unicodedata normalize works for Japanese diacrticis too!! `unicodedata.normalize('NFC', "か"+"\u3099") == "が"` -> True =) – alvas May 16 '20 at 19:45
  • 1
    Glad that worked! I will flag this as a duplicate of [Normalizing Unicode](https://stackoverflow.com/questions/16467479/normalizing-unicode) - for SO protocol. – andrewJames May 16 '20 at 21:44

0 Answers0