How do I iterate through unicode symbols, not bytes in python?

Question

Given the accented unicode word like u'кни́га', I need to strip the acute (u'книга'), and also change the accent format to u'кни+га', where '+' represents the acute over the preceding letter.

What I do now is using a dictionary of acuted and not acuted symbols:

accented_list = [u'я́', u'и́', u'ы́', u'у́', u'э́', u'а́', u'е́', u'ю́', u'о́']
regular_list = [u'я', u'и', u'ы', u'у', u'э', u'а', u'е',  u'ю', u'о']
accent_dict = dict(zip(accented_list, regular_list))

I want to do something like this:

def changeAccentFormat(word):
  for letter in accent_dict:
    if letter in word:
      its_index = word.index(letter)
      word = word[:its_index + 1] + u'+' + word[its_index + 1:]
  return word

But of course it does not work as desired. I noticed that this code:

>>> word = u'кни́га'
>>> for letter in word:
...     print letter

gives

(Well, i didn't expect the blank symbol to appear, but nevertheless). So I wonder, what is the simplest way to produce [u'к', u'н', u'и́', u'г', u'а']? Or maybe there is some way to solve my problem without it?

`>>> len(u'кни́га') 6` Python treats them like that only — thefourtheye, Dec 26 '13 at 12:08
Define "unicode symbol". Python doesn't iterate through bytes, it iterates though code points (as long as you stay inside the BMP or use versions >= 3.3). That's a meaningful abstraction, though not always what you want. Depending on whether you intentionally left out the accent over the" г" in your desired output, you seem to either want grapheme clusters or remove combining characters. Perhaps you could describe your use case so we can help you choose? — , Dec 26 '13 at 12:16
@delnan, the acute is over the 'и́' letter and is still present in the desired output :) But I see, I will edit the question — FrauHahnhen, Dec 26 '13 at 12:23
D'oh! Well that narrows it down, the concept you should look into is grapheme clusters. — , Dec 26 '13 at 12:26
"symbol" comes from ASCII thinking. Do you want to count graphemes? http://utf8everywhere.org explains all kinds — Pavel Radzivilovsky, Dec 27 '13 at 05:26
you are already iterating over Unicode symbols. You might want to iterate over "user-perceived characters" (grapheme clusters) instead. You could use use `characters = regex.findall(u'\\X', word)`. See [Why python string cut returns 11 symbols when 12 is requested?](http://stackoverflow.com/a/23323520/4279) — jfs, May 09 '15 at 14:47

Lukas Graf · Accepted Answer · 2013-12-26T14:21:27.303

First of all, in regard to iterating over characters instead bytes, you're already doing it right - your word is an unicode object, not an encoded bytestring.

Now, for combination characters in Unicode:

For many characters containing combination characters there is a composed and decomposed form of writing it, the composed being one code point, and the decomposed a sequence of two (or more?) code points:

Composed and decomposed form of c with cedilla

See U+00E7, U+0063 and U+0327

So in Python, you could either write either form, it will get composed at display time to the same character:

>>> combining_cedilla = u'\u0327'
>>> c_with_cedilla = u'\u00e7'
>>> letter_c = u'\u0063'
>>>
>>> print c_with_cedilla
ç
>>> print letter_c + combining_cedilla
ç

In order to convert between composed and decomposed forms, you can use unicodedata.normalize():

>>> import unicodedata
>>> comp = unicodedata.normalize('NFC', letter_c + combining_cedilla)
>>> decomp = unicodedata.normalize('NFD', c_with_cedilla)
>>>
>>> print comp
ç
>>> print decomp
ç

(NFC stands for "normal form C" (composed), and NFD for "normal form D" (decomposed).

They still are different forms though - one consisting of one code point, the other of two:

>>> comp == decomp
False
>>> len(comp)
1
>>> len(decomp)
2

However, in your case, there simply does not seem to be a combined character for the lowercase и with an accent acute (there is one for и with an accent grave)

dawg · Answer 2 · 2015-05-07T20:03:43.893

3

You can produce [u'к', u'н', u'и́', u'г', u'а'] with the regex module.

Here is the word you have by each user perceived character:

>>> import regex
>>> word = u'кни́га'
>>> len(word)
6
>>> regex.findall(r'\X', word)
['к', 'н', 'и́', 'г', 'а']
>>> len(regex.findall(r'\X', word))
5

edited May 07 '15 at 20:03

answered May 07 '15 at 19:29

dawg

98,345
23
131
206

score 1 · Answer 3 · answered Dec 26 '13 at 13:51

Acutes are represented by codepoint 301, COMBINING ACUTE ACCENT, so a simple string character replacement should suffice:

>>>print u'кни́га'.replace(u'\u0301', "+")
кни+га

If you encounter accented characters that are not encoded with a combining accent, unicodedata.normalize should do the trick

How do I iterate through unicode symbols, not bytes in python?

3 Answers3