6

I've been having a problem with Unicode in python3 and I can't seem to understand why that's happening.

symbol= "ῇ̣"
print(len(symbol))
>>>>2

This letter comes from a word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ where I have combining diacritical marks. I want to do the statistical analysis in Python 3 and store the results in a database, the thing is that I also store the character's position (index) in the text. The database-application correctly counts the symbol-variable in the example as one-character, whereas Python counts it as two - throwing off the entire indexing.

The project requires me to keep the diacritics, so I can't simply ignore them or do a .replace("combining diacritical mark","") on the string.

Since Python3 has unicode as default for strings I'm a bit dumbfounded by this.

I have tried to use the base(), strip(), and strip_length() method from Greek-accentuation: https://pypi.org/project/greek-accentuation/ but that's not helping either.

Project requirements are:

  • Detect the alphabet belonging to the character (OK)
  • Store string-positions (needed for highlighting in the database) (NotOK)
  • Be able to process multiple languages/alphabets mixed in one string. (OK)
  • Iterate over CSV-input. (OK)
  • Ignore set of predefined strings (OK)
  • Ignore set of strings that match certain conditions (OK)

This is the simplified code for this project:

# -*- coding: utf-8 -*-
import csv
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
with open("tbltext.csv", "r", encoding="utf8") as txt:
    data = csv.reader(txt)
    for row in data:
        text = row[1]
        ### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length '-',...)
        ###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.
        for letter in text:
            lang = ad.detect_alphabet(letter)

If I use the word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ as example with a forloop; my result is:

>>> word = "ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ"
>>> for letter in word:
...     print(letter)
...
ἐ
̣
ν
̣
τ
̣
ῇ
̣
[
α
ὐ
τ
]
ῇ

How can I make Python see letters with a combining diacritical mark as one letter instead of making it print the letter and the diacritical mark separately?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Clueless_captain
  • 420
  • 2
  • 13

1 Answers1

3

The string has 2 in length, so this is correct: two code point:

>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list(unicodedata.name(c) for c in symbol)
['GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI', 'COMBINING DOT BELOW']

So you should not use len to count the characters.

You could count the characters that are non-combining, so:

>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
1

From: How do I get the "visible" length of a combining Unicode string in Python? (but I ported it to python3).

But this is also not the optimal solution, depending on the scope of counting characters. I think in your case it is enough, but fonts could merge characters into ligatures. On some languages, that are visually new (and very different) characters (and not like ligature in western languages).

As last comment: I think you should normalize strings. With above code, in this case it doesn't matter, but in other cases, you may get different results. Especially if someone used combatibility characters (e.g. mu for units, or Eszett, instead of the true Greek characters).

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
  • And I'm asking for help, on how to name such function – Giacomo Catenazzi Feb 20 '19 at 14:23
  • Okay, that makes sense now. – Clueless_captain Feb 21 '19 at 09:13
  • Thanks for the explanation and suggestion. You could call it screen_length or combined_length (as this behaviour is limited to characters in the unicode block of 'Combining characters'.) I keep finding it odd that the DB and Notepad++ identify the combined characters as a single-length-unit; though I'll have to live with it. I haven't thought of using normalization for those cases, thanks for pointing that out too! – Clueless_captain Feb 21 '19 at 09:20
  • Notetepad++ works on visually, and you cannot get in the middle between character and combining character. For python, often you need to parse and get the code points (e.g. `for i in range(len(my_string))`) so you need to address every single code point. It would much more complex if one "character" could have hundreds of code points, and not more easily converted into numbers/bytes). Languages and characters are complex. Unicode has a lot of description and problems, but there is not "one size fits all". By having a single coding (unicode) is already a great success. – Giacomo Catenazzi Feb 21 '19 at 09:26