I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str
. For instance:
>>>print ["ё"]
['\xd1\x91']
This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e"
, without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё"
:
>>>print [u"ё"]
[u'\u0451']
But the str
s are being passed around as variables, and so can't be prefixed with u, and unicode()
gives a UnicodeDecodeError
(ascii codec can't decode...).
So... how do I get around this? If it helps, I am using python 2.7