in a python 3 string from another program, ü is two characters, the u and the umlaut. Why?

Question

I am taking a string from a php 7 program, and processing it in Python 3.7.2.

my_str = 'ü'

print(type(my_str))

str_list = list(my_str)

for letter in str_list:
    print('letter',letter)

if 'ü' in my_str:
    print('we have the umlaut')
else:
    print('we have no umlaut')

Here is the output:

<class 'str'>
letter u
letter ̈
we have no umlaut

Why is the letter u separate from the umlaut? If I type a ü in this string, it is read as 'ü', and the test for 'ü' succeeds. How can I correct this string, so it has a ü and not two separate characters?

Thanks in advance for any tips. I have searched for this and found nothing helpful.

Welcome to the world of non-spacing unicode characters! Its a [COMBINING DIAERESIS](https://www.fileformat.info/info/unicode/char/0308/index.htm) — tdelaney, May 12 '20 at 17:46
You can normalize the string as in https://stackoverflow.com/questions/16467479/normalizing-unicode. `'ü' in unicodedata.normalize('NFC', my_str)` is `True`. — tdelaney, May 12 '20 at 17:48
For others bumping into unicode issues like this, I got the character ordinal in hex - `hex(ord(my_str[1]))` then did an internet search on "unicode U+0308" to get details on the character. — tdelaney, May 12 '20 at 17:51
This is a great question. How many of us normalize our unicode data before use? Hands? (I confess, I don't). But its a subtle source of bugs. — tdelaney, May 12 '20 at 17:54
@snakecharmerb - Good point. `unicodedata.category` is also interesting. — tdelaney, May 12 '20 at 18:30

score 2 · Accepted Answer · answered May 12 '20 at 17:56

The character in your string and the one in your condition have different representations:

from unicodedata import name, normalize


my_str = 'ü'
for c in my_str:
    print(name(c))

# LATIN SMALL LETTER U
# COMBINING DIAERESIS

your_u = 'ü'  # copy pasted from your 'if ...' line
for c in your_u:
    print(name(c))

# LATIN SMALL LETTER U WITH DIAERESIS

You can normalize your string:

my_normalized_str = normalize('NFC', my_str)

for c in my_normalized_str:
    print(name(c))

#LATIN SMALL LETTER U WITH DIAERESIS

And now your comparison will work as expected:

if 'ü' in my_normalized_str:
    print('we have the umlaut')
else:
    print('we have no umlaut')

# we have the umlaut

Whoa. This is the answer! Thank you, @Thierry Lathuille, and commenters tdelaney and snakecharmerb. My world just got bigger. :-) — excyberlabber, May 12 '20 at 21:53

in a python 3 string from another program, ü is two characters, the u and the umlaut. Why?

1 Answers1