How to get a character from a string - getting wrong character and wrong length of string

Question

Below python gives wrong length of string and wrong character.
Does anybody here have any idea?

>>> w ='lòng'
>>> w 
'lòng'
>>> print (w)
lòng
>>> len(w)
5
>>> for ch in w:
...     print (ch + "-") 
... 
l- 
o- 
- 
n- 
g- 
>>>

length should be 4 and w[1] should be 'ò'. Javascript and visual basic work but not python — Command Me, Oct 15 '19 at 17:26
Yours is a unicode string. The python `len()` method for strings counts __code points__. The second character uses 2 code points. — rdas, Oct 15 '19 at 17:28
@rdas the question you identified as a duplicate is actually opposite of the problem described here. — Mark Ransom, Oct 15 '19 at 17:43
The duplicate target doesn't explain this particular case particularly well, IMO, but take a look at the Unicode How-to section on [comparing strings](https://docs.python.org/3/howto/unicode.html#comparing-strings) for an explanation of how unicode may compose an accented character from two separate characters. — snakecharmerb, Oct 15 '19 at 17:44
@CommandMe I tested `len('lòng')` in both Python 3.7 and 3.8 on MacOS, and it is 4, not 5. The question cannot be reproduced. — Asclepius, Oct 16 '19 at 16:07
@CommandMe This issue could be reproduced only with Python 2.7, but people shouldn't be using Python 2.7 anymore. — Asclepius, Oct 17 '19 at 15:26

snakecharmerb · Answer 1 · 2019-10-16T17:22:57.657

The issue here is that in unicode some characters may be composed of combinations of other characters. In this case, 'lòng' includes lower case 'o' and a grave accent as separate characters.

>>> import unicodedata as ud
>>> w ='lòng'
>>> for c in w:
...     print(ud.name(c))
... 
LATIN SMALL LETTER L
LATIN SMALL LETTER O
COMBINING GRAVE ACCENT
LATIN SMALL LETTER N
LATIN SMALL LETTER G

This is a decomposed unicode string, because the accented 'o' is decomposed into two characters. The unicodedata module provides the normalize function to convert between decomposed and composed forms:

>>> for c in ud.normalize('NFC', w):
...     print(ud.name(c))
... 
LATIN SMALL LETTER L
LATIN SMALL LETTER O WITH GRAVE
LATIN SMALL LETTER N
LATIN SMALL LETTER G

If you want to know whether a string is normalised to a particular form, but don't want to actually normalise it, and are using Python 3.8+, the more efficient unicodedata.is_normalized function can be used (credit to user Acumenus):

>>> ud.is_normalized('NFC', w)
False
>>> ud.is_normalized('NFD', w)
True

The Unicode HOWTO in the Python documentation includes a section on comparing strings which discusses this in more detail.

score 1 · Answer 2 · answered Oct 15 '19 at 17:53

Unicode allows a lot of flexibility on encoding a character. In this case, the ò is actually made up of 2 Unicode code points, one for the base character o and one for the accent mark. Unicode also has a character that represents both at the same time, and it doesn't care which you use. Unicode allows a lot of flexibility on encoding a character. Python includes a package unicodedatathat can provide a consistent representation.

>>> import unicodedata
>>> w ='lòng'
>>> len(w)
5
>>> len(unicodedata.normalize('NFC', w))
4

As of Python 3.8, the [`is_normalized`](https://docs.python.org/3/library/unicodedata.html#unicodedata.is_normalized) function also exists and [is faster](https://docs.python.org/3/whatsnew/3.8.html#unicodedata) to check first. — Asclepius, Oct 16 '19 at 11:16

score 0 · Answer 3 · answered Oct 16 '19 at 06:58

0

The problem is that the len function and the in operator are broken w.r.t. Unicode.

As of now, there are two answers that claim normalisation is the solution. Unfortunately, that's not true in general:

>>> w = 'Ꙝ̛͋ᄀᄀᄀ각ᆨᆨ❤️'
>>> len(w)
19
>>> import unicodedata
>>> len(unicodedata.normalize('NFC', w))
19
>>> # 19 is still wrong

To handle this task correctly, you need to operate on graphemes:

>>> from grapheme import graphemes
>>> w = 'Ꙝ̛͋ᄀᄀᄀ각ᆨᆨ❤️'
>>> len(list(graphemes(w)))
3
>>> # 3 is correct
>>> for g in graphemes(w):
...     print(g)
Ꙝ̛͋
ᄀᄀᄀ각ᆨᆨ
❤️

Also works for your w = 'lòng' input, correctly segments into 4 without any normalisation.

answered Oct 16 '19 at 06:58

daxim

39,270
4
65
132

Yes, still broken in 3.8. You could have tested that yourself. They just updated some data files, of course that does not change how `len` works. – daxim Oct 16 '19 at 12:00
And you could have in your answer linked to the third-party package you are talking about. – Asclepius Oct 16 '19 at 12:46
1

I tested `len('lòng')` in both Python 3.7 and 3.8 on MacOS, and it is 4, not 5. As for your string `Ꙝ̛͋ᄀᄀᄀ각ᆨᆨ❤️`, its length was 14, not 19. – Asclepius Oct 16 '19 at 16:09
Your `lòng` differs from the `lòng` all the other participants in this thread are using. ––– The Stackexchange software sabotaged my example input. Here it is again in escaped notation so I can circumvent that bug: \N{U+0A65C}\N{U+0031B}\N{U+0034B}\N{U+00356}\N{U+00489}\N{U+01100}\N{U+01100}\N{U+01100}\N{U+0AC01}\N{U+011A8}\N{U+011A8}\N{U+1F469}\N{U+0200D}\N{U+02764}\N{U+0FE0F}\N{U+0200D}\N{U+1F48B}\N{U+0200D}\N{U+1F469} – daxim Oct 17 '19 at 06:32
The original issue could be reproduced by me only with Python 2.7 which is a version of Python that people shouldn't be using anymore. – Asclepius Oct 17 '19 at 15:27

How to get a character from a string - getting wrong character and wrong length of string

3 Answers3