Below python gives wrong length of string and wrong character.
Does anybody here have any idea?
>>> w ='lòng'
>>> w
'lòng'
>>> print (w)
lòng
>>> len(w)
5
>>> for ch in w:
... print (ch + "-")
...
l-
o-
-
n-
g-
>>>
Below python gives wrong length of string and wrong character.
Does anybody here have any idea?
>>> w ='lòng'
>>> w
'lòng'
>>> print (w)
lòng
>>> len(w)
5
>>> for ch in w:
... print (ch + "-")
...
l-
o-
-
n-
g-
>>>
The issue here is that in unicode some characters may be composed of combinations of other characters. In this case, 'lòng' includes lower case 'o' and a grave accent as separate characters.
>>> import unicodedata as ud
>>> w ='lòng'
>>> for c in w:
... print(ud.name(c))
...
LATIN SMALL LETTER L
LATIN SMALL LETTER O
COMBINING GRAVE ACCENT
LATIN SMALL LETTER N
LATIN SMALL LETTER G
This is a decomposed unicode string, because the accented 'o' is decomposed into two characters. The unicodedata module provides the normalize function to convert between decomposed and composed forms:
>>> for c in ud.normalize('NFC', w):
... print(ud.name(c))
...
LATIN SMALL LETTER L
LATIN SMALL LETTER O WITH GRAVE
LATIN SMALL LETTER N
LATIN SMALL LETTER G
If you want to know whether a string is normalised to a particular form, but don't want to actually normalise it, and are using Python 3.8+, the more efficient unicodedata.is_normalized function can be used (credit to user Acumenus):
>>> ud.is_normalized('NFC', w)
False
>>> ud.is_normalized('NFD', w)
True
The Unicode HOWTO in the Python documentation includes a section on comparing strings which discusses this in more detail.
Unicode allows a lot of flexibility on encoding a character. In this case, the ò
is actually made up of 2 Unicode code points, one for the base character o
and one for the accent mark. Unicode also has a character that represents both at the same time, and it doesn't care which you use. Unicode allows a lot of flexibility on encoding a character. Python includes a package unicodedata
that can provide a consistent representation.
>>> import unicodedata
>>> w ='lòng'
>>> len(w)
5
>>> len(unicodedata.normalize('NFC', w))
4
The problem is that the len
function and the in
operator are broken w.r.t. Unicode.
As of now, there are two answers that claim normalisation is the solution. Unfortunately, that's not true in general:
>>> w = 'Ꙝ̛͋ᄀᄀᄀ각ᆨᆨ❤️'
>>> len(w)
19
>>> import unicodedata
>>> len(unicodedata.normalize('NFC', w))
19
>>> # 19 is still wrong
To handle this task correctly, you need to operate on graphemes:
>>> from grapheme import graphemes
>>> w = 'Ꙝ̛͋ᄀᄀᄀ각ᆨᆨ❤️'
>>> len(list(graphemes(w)))
3
>>> # 3 is correct
>>> for g in graphemes(w):
... print(g)
Ꙝ̛͋
ᄀᄀᄀ각ᆨᆨ
❤️
Also works for your w = 'lòng'
input, correctly segments into 4 without any normalisation.