I'm working on string manipulation in Python (v3.3), and I'm wondering if there's a predictable way to detect the addition of diacritical markings on a given character.
So for instance is there some relationship between 'α' # ord('α') = 945)
(Greek unmarked alpha) and 'ᾶ' # ord('ᾶ') = 8118
(Greek alpha with a circumflex) and 'ω' # ord('ω') = 969
(Greek unmarked omega) and 'ῶ' # ord('ῶ') = 8182
(Greek omega with a circumflex)?
Are there any manipulations that can be done to clear the diacritics? Or to add a diacritic, for example when marking a long vowel: 'ᾱ' # ord('ᾱ') = 8113
?
Thanks!
Edit: I've played around with both the unidecode
package and unicodedata
. I'm not looking simply to normalize strings; I'm interested in resources for understanding the byte manipulations that happen behind the scenes to add, say, a circumflex or a macron to a standard alpha. Another way of asking that question is how does chr(945) # 'α'
become or relate to chr(8113) # 'ᾱ'
at a very low level? Maybe I'm thinking of this (text) in completely the wrong way, and I'd be interested in learning that too.
This question doesn't actually have so much to do with Python as it does with text encoding in general, but I mention Python just in case any of its peculiarities come into play.
Edit 2: I should also add that I'm more interested in how something like unidecode
works than in actually using it at the moment. unidecode('ῶ')
and unidecode('ὄ') # that's an omicron, not an 'o'
both return 'o'
, and that return value isn't as helpful to me at the moment as a higher-level understanding of how the unidecode
module arrives at that return value.