6

I am trying to convert some words that contains Turkish characters to lowercase.

Reading words from a file which is utf-8 encoded:

with open(filepath,'r', encoding='utf8') as f:
            text=f.read().lower()

When I try to convert to lowercase, the Turkish character İ gets corrupted. However when I try to convert to uppercase it works fine.

Here is example code:

str = 'İşbirliği'
print(str)
print(str.lower())

Here is how it looks when it is corrupted:

this is how it is seen when it is corrupted

What's going on here?

Some info that might be useful:

  • I'm using Windows 10 cmd prompt
  • Python version 3.6.0
  • chcp is set to 65001
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
moth
  • 427
  • 1
  • 4
  • 18
  • 1
    Post actual text not image links, otherwise difficult to copy and paste for ourselves to try. – Mark Tolonen Mar 19 '17 at 14:12
  • Please see [Why may I not upload images of code on SO when asking a question?](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images-of-code-on-so-when-asking-a-question) – PM 2Ring Mar 19 '17 at 14:17
  • @MarkTolonen in this case, an image is the only reasonable way to demonstrate the problem. – Zero Piraeus Mar 19 '17 at 14:47
  • @zero Ah, OK. It would be nice to also have the text pasted from the console as well, even though it could get transformed a little by the time we see it in our browsers. – PM 2Ring Mar 19 '17 at 15:10
  • @Zero An image may also be needed, but the actual original text is still helpful for those without a Turkish keyboard. – Mark Tolonen Mar 19 '17 at 15:10
  • I tried to copy text from console but it get transformed when i paste it to posting screen – moth Mar 19 '17 at 16:08

1 Answers1

9

It's not corrupted.

Turkish has both a dotted lowercase i and a dotless lowercase ı, and similarly a dotted uppercase İ and a dotless uppercase I.

This presents a challenge when converting the dotted uppercase İ to lowercase: how to retain the information that, if it needs to be converted back to uppercase, it should be converted back to the dotted İ?

Unicode solves this problem as follows: when İ is converted to lowercase, it's actually converted to the standard latin i plus the combining character U+0307 "COMBINING DOT ABOVE". What you're seeing is your terminal's inability to properly render (or, more to the point, refrain from rendering) the combining character, and has nothing to do with Python.

You can see that this is happening using unicodedata.name():

>>> import unicodedata
>>> [unicodedata.name(c) for c in 'İ']
['LATIN CAPITAL LETTER I WITH DOT ABOVE']
>>> [unicodedata.name(c) for c in 'İ'.lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']

... although, in a working and correctly configured terminal, it will render without any problems:

>>> 'İ'.lower()
'i̇'

As a side note, if you do convert it back to uppercase, it will remain in the decomposed form:

>>> [unicodedata.name(c) for c in 'İ'.lower().upper()]
['LATIN CAPITAL LETTER I', 'COMBINING DOT ABOVE']

… although you can recombine it with unicodedata.normalize():

>>> [unicodedata.name(c) for c in unicodedata.normalize('NFC','İ'.lower().upper())]
['LATIN CAPITAL LETTER I WITH DOT ABOVE']

For more information, see:

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
  • Thanks for clear information in your post. If it was only rendering i would not mind. i am using a regular expression that split text to words. since there is an extra character in that word that causes a wrong split. >>> re.split('\W+','İşbirliği önemlidir') ['İşbirliği', 'önemlidir'] >>> re.split('\W+','İşbirliği önemlidir'.lower()) ['i', 'şbirliği', 'önemlidir'] – moth Mar 19 '17 at 16:22
  • 1
    I would call this a bug in Python's `re.split()`. A joining character should not be included in `\W` although the semantics are a bit complex (what if it was joined to a non-word character? Can you even do that?) – tripleee Mar 19 '17 at 17:03
  • 1
    I agree with @tripleee that this is a bug in `re`. The more modern [regex](https://pypi.python.org/pypi/regex/) treats combining characters as word characters: `regex.split('\W+','İşbirliği önemlidir'.lower())` returns `['i̇şbirliği', 'önemlidir']` as expected, in case that helps. – Zero Piraeus Mar 19 '17 at 21:42
  • thank you @tripleee I was thinking something was wrong with encoding. regex worked as expected. – moth Mar 19 '17 at 22:22