19

Türkish chars 'ÇçĞğİıÖöŞşÜü' are not handled correctly in utf-8 encoding altough they all seem to be defined. Charcodes of all of them is 65533 (replacemnt character, possibly for error display) in usage and a question mark or box is displayed depending on the selected font. In some cases 0/null is returned as charcode. On the internet, there are lots of tools which give utf-8 definitions of them but I am not sure if tools use any defined (real/international) registry or dynamicly create the definition with known rules and calculations. Fonts for them are well-defined and no problem to display them when we enter code points manually. This proves that they are defined in utf-8. But on the other hand they are not handled in encodings or tranaformations such as ajax requests/responses.

So the base question is "HOW CAN WE DEFINE A CODEPOINT FOR A CHAR"? The question may be tailored as follows to prevent mis-conception. Suppose we have prepared the encoding data for "Ç" like this -> Character : Ç Character name : LATIN CAPITAL LETTER C WITH CEDILLA Hex code point : 00C7 Decimal code point : 199 Hex UTF-8 bytes : C387 ...... Where/How can we save this info to be a standard utf-8 char? How can we distribute/expose it (make ready to be used by others) ? Do we need any confirmation by anybody/foundation (like unicode/utf-8 consortium) How can we detect/fixup errors if they are already registered but not working correctly? Can we have custom-utf8 configuration? If yes how?

Note : No code snippet is needed here as it is not mis-usage problem.

Cœur
  • 37,241
  • 25
  • 195
  • 267
İlhan ÇELİK
  • 307
  • 1
  • 2
  • 8
  • 3
    Every Unicode character, including the Turkish alphabet, can be expressed in UTF-8 encoding. If you are seeing replacement characters, your text is either encoded incorrectly or you are using the wrong encoding to read it. The conversion between Unicode codepoints and UTF-8 byte strings is well-defined and fixed (see http://en.wikipedia.org/wiki/UTF-8). You cannot customize it. – Brent Ramerth Feb 04 '13 at 07:40
  • 1
    I had read http://en.wikipedia.org/wiki/UTF-8 and many more before asking my question. When I dont give any charset, then page loads nice but ajax requests (except Firefox and Opera) fails. When I give iso-8859-9/windows-1254 page loads good, ajax works good with only Firefox. When I give utf-8, then page cant show special chars but ajax works with all (6 major) browsers. Those problems do not occure with other languages. This shows that there are some irrelevant definitions. By myself there is no problem by doing conversions but I would like Turkish chars to work as well as other languages. – İlhan ÇELİK Feb 05 '13 at 04:14
  • @İlhanÇELİK The page of this question (stackoverflow) is encoded in UTF-8. Do you see characters used in Turkish correctly or are they broken here too? – Joni Feb 05 '13 at 08:23
  • Joni, as I wrote before there is no display problems. Problems arise while transforming or encoding/decoding them. Manual (or functional) conversions work fine. But why should we have more work on every step of execution while working with Turkish chars? I work around any problem in a suitable/required way but they are personal efforts. I am going to reply this question to give a detailed example to demonstrate a single point in series of similar problems. – İlhan ÇELİK Feb 05 '13 at 13:52
  • I am not allowed to answer my own question. So I am going to copy/paste the snippets as short comments. – İlhan ÇELİK Feb 05 '13 at 13:57
  • ÇçĞğİıÖöŞşÜü and ABCDEF

    ������������ and ABCDEF when we leave out charset declaration we get -> ÇçĞğİıÖöŞşÜü and ABCDEF

    – İlhan ÇELİK Feb 05 '13 at 13:59
  • When we declare iso-8859-9 or windows-1254 we get ÇçĞğİıÖöŞşÜü and ABCDEF again. So you may advice to use one of them, but then another problem arise. XMLHttpReuest or equalents use only utf-8. No way to declare a charset unless you use xml portion. I want to use request/response portion directly. (next cpmment will be ajax send/read) – İlhan ÇELİK Feb 05 '13 at 14:19
  • TRANSFER: When charset is utf-8 then there is no ajax problem. When charset is other than utf-8 we send ÇçĞğİıÖöŞşÜü and ABCDEF then Firefox and Opera has no problems but all other browsers displays ������������ and ABCDEF. It's not a display problem. Server side dont get correct charcodes. – İlhan ÇELİK Feb 05 '13 at 14:31
  • So, when we send ÇçĞğİıÖöŞşÜü and ABCDEF form client with non-utf-8 charset, the browsers cant encode the text into correct utf-8 format. But even if the charset is not utf-8, when we send manually inserted utf-8 encoded text, browsers can display it. I solve this problem by encoding text into correct utf-8 values with javascript or simply base64 encoding the text before sending. Then there is alot of work at both client and server sides. This is NOT A REAL SOLUTION not a solution, just a work-around. – İlhan ÇELİK Feb 05 '13 at 14:44
  • CONCLUSOIN: I am not asking how to encode/decode transform/transfer etc. I believe that Turkishs chars are badly-defined in utf-8. So I am asking the way of re-defining or newly defining them. How can we make them to be handled correctly with any default utf-8 settings? – İlhan ÇELİK Feb 05 '13 at 14:54
  • WARNING : SOURCE FILE ENCODING FOR the examples above is ISO-8859-9. – İlhan ÇELİK Feb 05 '13 at 15:09
  • 1
    EITHER the source file has to be encoded in utf-8 OR you have to change the declared encoding. You can't lie to the browser about the document's encoding and expect it to render correctly. XHR is another story: most utilities just assume utf-8, so you must encode the source in utf-8. This applies to all languages of the world, Turkish is no different. – Joni Feb 06 '13 at 14:00
  • I have mentioned source code charset because when source-charset is changed the problematic chars displays some other symbols. I see no point in discussing them again. By the way Joni you wrote about Locale somewhere. Locales are local as it is seen. I am not interseted in a single languaged site addressing a single country. I am working on an interactive-international site which transfers and displays contents from different countries and languages in the same content. The best method to overcome most of the problems is to use hidden iframes as target and post requests (form element) and – İlhan ÇELİK Feb 07 '13 at 01:05
  • get response from iframe's document at iframe.onload(). This way charset declarations can be solved, but keeping tracks of every action becomes very complicated as we add in new actions and variables. The simplest way is xhr based communication for requests. At this point charset-conflicts are problem. I tested the same things in Turkish, Arabic and Chinese (not all languages). I made server simply echo what I send and in Chinese and Arabic I get the same but in Turkish I had always problems. Problems slightly differ with different browsers. Firefox has less problems, it worked in most cases. – İlhan ÇELİK Feb 07 '13 at 01:17
  • Please explain what exactly is your problem. I assure you the Turkish language, and the characters that are used to when writing it, is _not_ treated differently from any other language in the world. Every one of us has to deal with these problems. I mentioned locales because you mentioned alphabetic order and case conversions, and those depend on the language and country you are in. – Joni Feb 07 '13 at 07:34
  • I got tired of discussing on unrelated subjects combined into my problem. Look in this -> 90:Z, 122:z, 65:A, 97:a, 199:Ç, 231:ç, 208:Ğ, 240:ğ, 221:İ, 253:ı, 214:Ö, 246:ö, 222:Ş, 254:ş, 220:Ü, 252:ü. All chars should be in same order as they are in the alphabet. z CAN NOT be smaller than any other char/charcode in Turkish. If 'İ' is 221 then 'I' must be 220 and 'i' must be 253 (+ 0x20), 'ı' must be 252, 'ğ' must be 250 and 'Ğ' must be 218 where 'G' must be 217. If 'İ':221 is incorrect, then correct value should be assigned and all others ought to be fixed to match needs. – İlhan ÇELİK Feb 07 '13 at 14:15
  • Turkish alphabet (UC(BIG) is ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZ and anybody must be able to convert it to lowercase like this (lc = LC&0x20;) or be able to test if((ch&0x20)==ch) then ch is lowercase... Similarly, if(ch <= Turkish(z) && ch >= Turkish(A)) then ch is a Turkish comparable/printable letter... As current Unicode definitions for Turkis don't match this criteria algorithmic conversions/encodings do not work correctly. To work them we need lots of extra code which leads us unpredictable results in case of any carelessness and it is obviously a loss of time and effort. ..... – İlhan ÇELİK Feb 07 '13 at 14:34

1 Answers1

24

The charcters you mention are present in Unicode. Here are their character codes in hexadecimal and how they are encoded in UTF-8:

      Ç     ç     Ğ     ğ     İ     ı     Ö     ö     Ş     ş     Ü     ü
Code: 00c7  00e7  011e  011f  0130  0131  00d6  00f6  015e  015f  00dc  00fc
UTF8: c3 87 c3 a7 c4 9e c4 9f c4 b0 c4 b1 c3 96 c3 b6 c5 9e c5 9f c3 9c c3 bc

This means that if you write for example the bytes 0xc4 0x9e into a file you have written the character Ğ, and any software tool that understands UTF-8 must read it back as Ğ.

Update: For correct alphabetic order and case conversions in Turkish you have to use a library that understand locales, just like for any other natural language. For example in Java:

Locale tr = new Locale("TR","tr");     //    Turkish locale
print("ÇçĞğİıÖöŞşÜü".toUpperCase(tr)); //    ÇÇĞĞİIÖÖŞŞÜÜ
print("ÇçĞğİıÖöŞşÜü".toLowerCase(tr)); //    ççğğiıööşşüü

Notice how i in uppercase becomes İ, and I in lowercase becomes ı. You don't say which programming language you use but surely its standard library supports locales, too.

Unicode defines the code points and certain properties for each character (for example, if it's a digit or a letter, for a letter if it's uppercase, lowercase, or titlecase), and certain generic algorithms for dealing with Unicode text (e.g. how to mix right-to-left text and left-to-right text). Alphabetic order and correct case conversion are defined by national standardization bodies, like Institute of Languages of Finland in Finland, Real Academia Española in Spain, independent of Unicode.

Update 2:

The test ((ch&0x20)==ch) for lower case is broken for most languages in the world, not just Turkish. So is the algorithm for converting upper case to lower case you mention. Also, the test for being a letter is incorrect: in many languages Z is not the last letter of the alphabet. To work with text correctly you must use library functions that have been written by people who know what they are doing.

Unicode is supposed to be universal. Creating national and language specific variants of encodings is what lead us to the mess that Unicode is trying to solve. Unfortunately there is no universal standard for ordering characters. For example in English a = ä < z, but in Swedish a < z < ä. In German Ü is equivalent to U by one standard, and to UE by another. In Finnish Ü = Y. There is no way to order code points so that the ordering would be correct in every language.

Joni
  • 108,737
  • 14
  • 143
  • 193
  • Joni, thank you very much for answering but yours isn't the required answer. There are many lexical and logical mistakes with those Turkish special chars. Upper/lower case encodings are wrong. Alphabetical orders are incorrect. They should be corrected or redefined and I am asking the way and procuders for utf-8 definitions. Who will correct them? How will they be corrected? No problem with Arabic Chineses or Persian. They should be more problematic than Turkish but just the opposite. That means Turkish people weren't involved in any development stage. Those mistakes can't go for ever... – İlhan ÇELİK Feb 05 '13 at 03:50
  • See my answer at http://stackoverflow.com/questions/14560531/os-x-10-6-8-cannot-input-non-ascii-utf-8-chars-e-g-a-a-o-in-python-intera/14763641#14763641. In utf-8, every language must have a full-range to define it's full-alphabet while keeping unicode untouched for the same display of the same chars. ((ch&0x20)==ch) or others are given to mean utf-8 MUST MATCH REQUREMENTS but it doesnt. I gave Arabic and Chinese as examples in many places. They work fine because they are not ASCII based alphabets and they have their own full-range. All ascii based, non-English alaphabets share same range. – İlhan ÇELİK Feb 08 '13 at 13:08
  • 1
    Ah, now I understand, so your complaint is that not every language is assigned a Unicode block of its own? That is a complication, but Turkish is not the only language affected by this complication. Consider all the latin-based Middle- and East European languages. That doesn't mean that UTF-8 is broken. – Joni Feb 08 '13 at 14:54
  • It could also be a problem with charset conversions and Windows encodings? ISO 8859-9 is the only charset that supports all turkish characters. ISO 8859-1 and ISO 8859-15 (and Windows 1252 which is based on them) only support western European languages completely. – 0x4a6f4672 Oct 11 '13 at 14:19
  • It's not UTF-8 that's broken, it's UniCode. For example, using the ASCII value for 'I' and 'i' in Turkish breaks upper/lower case conversion and sorting no matter the encoding. Sure, one can change locale, (if the code supports it) but that doesn't help with mixed language text. – WGroleau Oct 28 '22 at 23:28