3

I am taking a string from a php 7 program, and processing it in Python 3.7.2.

my_str = 'ü'

print(type(my_str))

str_list = list(my_str)

for letter in str_list:
    print('letter',letter)

if 'ü' in my_str:
    print('we have the umlaut')
else:
    print('we have no umlaut')

Here is the output:

<class 'str'>
letter u
letter ̈
we have no umlaut

Why is the letter u separate from the umlaut? If I type a ü in this string, it is read as 'ü', and the test for 'ü' succeeds. How can I correct this string, so it has a ü and not two separate characters?

Thanks in advance for any tips. I have searched for this and found nothing helpful.

excyberlabber
  • 629
  • 1
  • 10
  • 17
  • 2
    Welcome to the world of non-spacing unicode characters! Its a [COMBINING DIAERESIS](https://www.fileformat.info/info/unicode/char/0308/index.htm) – tdelaney May 12 '20 at 17:46
  • You can normalize the string as in https://stackoverflow.com/questions/16467479/normalizing-unicode. `'ü' in unicodedata.normalize('NFC', my_str)` is `True`. – tdelaney May 12 '20 at 17:48
  • For others bumping into unicode issues like this, I got the character ordinal in hex - `hex(ord(my_str[1]))` then did an internet search on "unicode U+0308" to get details on the character. – tdelaney May 12 '20 at 17:51
  • This is a great question. How many of us normalize our unicode data before use? Hands? (I confess, I don't). But its a subtle source of bugs. – tdelaney May 12 '20 at 17:54
  • @tdelaney There is also `unicodedata.name(my_str[1])` – snakecharmerb May 12 '20 at 18:08
  • @snakecharmerb - Good point. `unicodedata.category` is also interesting. – tdelaney May 12 '20 at 18:30

1 Answers1

2

The character in your string and the one in your condition have different representations:

from unicodedata import name, normalize


my_str = 'ü'
for c in my_str:
    print(name(c))

# LATIN SMALL LETTER U
# COMBINING DIAERESIS

your_u = 'ü'  # copy pasted from your 'if ...' line
for c in your_u:
    print(name(c))

# LATIN SMALL LETTER U WITH DIAERESIS

You can normalize your string:

my_normalized_str = normalize('NFC', my_str)

for c in my_normalized_str:
    print(name(c))

#LATIN SMALL LETTER U WITH DIAERESIS

And now your comparison will work as expected:

if 'ü' in my_normalized_str:
    print('we have the umlaut')
else:
    print('we have no umlaut')

# we have the umlaut
Thierry Lathuille
  • 23,663
  • 10
  • 44
  • 50
  • Whoa. This is the answer! Thank you, @Thierry Lathuille, and commenters tdelaney and snakecharmerb. My world just got bigger. :-) – excyberlabber May 12 '20 at 21:53