Unicode (Cyrillic) character indexing, re-writing in python

Question

I am working with Russian words written in the Cyrillic orthography. Everything is working fine except for how many (but not all) of the Cyrillic characters are encoded as two characters when in an str. For instance:

>>>print ["ё"]
['\xd1\x91']

This wouldn't be a problem if I didn't want to index string positions or identify where a character is and replace it with another (say "e", without the diaeresis). Obviously, the 2 "characters" are treated as one when prefixed with u, as in u"ё":

>>>print [u"ё"]
[u'\u0451']

But the strs are being passed around as variables, and so can't be prefixed with u, and unicode() gives a UnicodeDecodeError (ascii codec can't decode...).

So... how do I get around this? If it helps, I am using python 2.7

Then can be prefixed with str.format or using the correct encoding for unicode — Padraic Cunningham, Aug 04 '15 at 21:29

score 2 · Accepted Answer · answered Aug 04 '15 at 21:28

There are two possible situations here.

Either your str represents valid UTF-8 encoded data, or it does not.

If it represents valid UTF-8 data, you can convert it to a Unicode object by using mystring.decode('utf-8'). After it's a unicode instance, it will be indexed by character instead of by byte, as you have already noticed.

If it has invalid byte sequences in it... You're in trouble. This is because the question of "which character does this byte represent?" no longer has a clear answer. You're going to have to decide exactly what you mean when you say "the third character" in the presence of byte sequences that don't actually represent a particular Unicode character in UTF-8 at all...

Perhaps the easiest way to work around the issue would be to use the ignore_errors flag to decode(). This will entirely discard invalid byte sequences and only give you the "correct" portions of the string.

And then if I want to move it back to the original two-byte format, I use `mystring.encode('ascii')`? — sautedman, Aug 06 '15 at 18:09
@sautedman It's not a two-byte format - UTF-8 is a variable length encoding. But yes, you could call 'encode' if you wanted to. — Borealid, Aug 06 '15 at 21:07

score 1 · Answer 2 · answered Aug 04 '15 at 21:28

These are actually different encodings:

>>>print ["ё"]
['\xd1\x91']
>>>print [u"ё"]
[u'\u0451']

What you're seeing is the __repr__'s for the elements in the lists. Not the __str__ versions of the unicode objects.

But the strs are being passed around as variables, and so can't be prefixed with u

You mean the data are strings, and need to be converted into the unicode type:

>>> for c in ["ё"]: print repr(c)
...
'\xd1\x91'

You need to coerce the two-byte strings into double-byte width unicode:

>>> for c in ["ё"]: print repr(unicode(c, 'utf-8'))
...
u'\u0451'

And you'll see with this transform they're perfectly fine.

score 1 · Answer 3 · edited May 23 '17 at 12:06

1

To convert bytes into Unicode, you need to know the corresponding character encoding and call bytes.decode:

>>> b'\xd1\x91'.decode('utf-8')
u'\u0451'

The encoding depends on the data source. It can be anything e.g., if the data comes from a web page; see A good way to get the charset/encoding of an HTTP response in Python

Don't use non-ascii characters in a bytes literal (it is explicitly forbidden in Python 3). Add from __future__ import unicode_literals to treat all "abc" literals as Unicode literals.

Note: a single user-perceived character may span several Unicode codepoints e.g.:

>>> print(u'\u0435\u0308')
ё

edited May 23 '17 at 12:06

Community

1
1

answered Aug 05 '15 at 20:14

jfs

399,953
195
994
1,670

I assume in your last example, you are providing an "e" with a combining diacritic? – sautedman Aug 06 '15 at 16:52
yes, it is combining diaeresis. `unicodedata.normalize('NFC', u'\u0435\u0308') == u'\u0451' == u'ё'` – jfs Aug 06 '15 at 17:47

Unicode (Cyrillic) character indexing, re-writing in python

3 Answers3