Python: lower() method generates wrong letter in a string

Question

text = 'ÇEKİM GÜNÜ KALİTESİNİ DÜZENLERLSE'

sentence = text.split(' ')
print(sentence)

if "ÇEKİM" in sentence:
    print("yes-1")

print(" ")
sentence_ = text.lower().split(' ')
print(sentence_)
   
if "çekim" in sentence_:
    print("yes-2")

>> output: 

['ÇEKİM', 'GÜNÜ', 'KALİTESİNİ', 'DÜZENLERLSE']
yes-1
 
['çeki̇m', 'günü', 'kali̇tesi̇ni̇', 'düzenlerlse']

I have a problem about string. I have a sentence like a text. When I check a specific word in this sentence-splitted list, I can find "ÇEKİM" word (prints yes). However, while I make search by lowering sentence, I can not find in the list because it changes "i" letter. What is the reason of it (encoding/decoding) ? Why "lower()" method changes string in addition to lowering ? Btw, it is a turkish word. Upper:ÇEKİM - Lower:çekim

You used an ASCII lower case `i` in the literal string `"çekim"`, but `'İ'.lower()` does not give just the ASCII lower case `i`. It gives 'i' followed by [Unicode Character 'COMBINING DOT ABOVE' (U+0307)](https://www.fileformat.info/info/unicode/char/0307/index.htm). — Warren Weckesser, Dec 25 '20 at 17:53
Related: https://www.unicode.org/Public/13.0.0/ucd/SpecialCasing.txt and https://bugs.python.org/issue34723 — JosefZ, Dec 25 '20 at 18:25

Mark Tolonen · Accepted Answer · 2020-12-25T18:33:12.317

Turkish i and English i are treated differently. Capitalized Turkish i is İ, while capitalized English i is I. To differentiate Unicode has rules for converting to lower and upper case. Lowercase Turkish i has a combining mark. Also, converting the lower case version to upper case leaves the characters in a decomposed form, so proper comparison needs to normalize the string to a standard form. You can't compare a decomposed form to a composed form. Note the differences in the strings below:

#coding:utf8
import unicodedata as ud

def dump_names(s):
    print('string:',s)
    for c in s:
        print(f'U+{ord(c):04X} {ud.name(c)}')
    
turkish_i = 'İ'
dump_names(turkish_i)
dump_names(turkish_i.lower())
dump_names(turkish_i.lower().upper())
dump_names(ud.normalize('NFC',turkish_i.lower().upper()))

string: İ
  U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
string: i̇
  U+0069 LATIN SMALL LETTER I
  U+0307 COMBINING DOT ABOVE
string: İ
  U+0049 LATIN CAPITAL LETTER I
  U+0307 COMBINING DOT ABOVE
string: İ
  U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE

Some terminals also have display issues. My system displays '' with the dot over the m, not the i. For example, on the Chrome browser, below displays correctly:

>>> s = 'ÇEKİM'
>>> s.lower()
'çeki̇m'

But on one of my editors it displays as:

Image of editor with dot over m

So it appears something like this is what the OP is seeing. The following comparison will work:

if "çeki\N{COMBINING DOT ABOVE}m" in sentence_:
    print("yes-2")

Yes. I tested and it is correct. Can one solution be to transform all special characters like "İ" to the English language format "I" ? or do you have any alternative? — Mehmet Kazanç, Dec 25 '20 at 18:35
@Mehmet I'm no expert in Turkish, but misspelling the word doesn't seem like the best solution. Turkish has a dotless upper/lower (I/ı) as well that has similar problems. — Mark Tolonen, Dec 25 '20 at 18:41
@Mehmet See also https://en.wikipedia.org/wiki/Dotted_and_dotless_I and [What is the Turkey Test](https://stackoverflow.com/q/796986/235698). — Mark Tolonen, Dec 25 '20 at 18:45

Python: lower() method generates wrong letter in a string

1 Answers1