It's not corrupted.
Turkish has both a dotted lowercase i
and a dotless lowercase ı
, and similarly a dotted uppercase İ
and a dotless uppercase I
.
This presents a challenge when converting the dotted uppercase İ
to lowercase: how to retain the information that, if it needs to be converted back to uppercase, it should be converted back to the dotted İ
?
Unicode solves this problem as follows: when İ
is converted to lowercase, it's actually converted to the standard latin i
plus the combining character U+0307 "COMBINING DOT ABOVE". What you're seeing is your terminal's inability to properly render (or, more to the point, refrain from rendering) the combining character, and has nothing to do with Python.
You can see that this is happening using unicodedata.name()
:
>>> import unicodedata
>>> [unicodedata.name(c) for c in 'İ']
['LATIN CAPITAL LETTER I WITH DOT ABOVE']
>>> [unicodedata.name(c) for c in 'İ'.lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
... although, in a working and correctly configured terminal, it will render without any problems:
>>> 'İ'.lower()
'i̇'
As a side note, if you do convert it back to uppercase, it will remain in the decomposed form:
>>> [unicodedata.name(c) for c in 'İ'.lower().upper()]
['LATIN CAPITAL LETTER I', 'COMBINING DOT ABOVE']
… although you can recombine it with unicodedata.normalize()
:
>>> [unicodedata.name(c) for c in unicodedata.normalize('NFC','İ'.lower().upper())]
['LATIN CAPITAL LETTER I WITH DOT ABOVE']
For more information, see: