Why does "i" get replaced with "ı"

Question

I received a crash report from an application which was trying to read XML from a file it had previously written. After requesting the user send me the file, I compared it with what should have been written and found a really odd problem I haven't come across before.

Some (but not all) of the i characters had been replaced with ı - a dotless i. For example, a node named "title" was fine, but a node named "initialdirectory" had the first i replaced, the second was left alone, i.e. ınitialdirectory.

Until today I wasn't even aware there was such a character, but now I do and I just don't know how it was written like that - the XML was written using an XmlWriter with UTF8 encoding. Just a normal everyday write, nothing complicated.

I normally (well, since getting Resharper and it yells at me for skipping the parameter) use StringComparison.OrdinalIgnoreCase when doing IndexOf etc, but I'm at a loss on how I'm supposed to do this when writing data, unless I'm supposed to start changing thread cultures.

Has anyone experienced a similar issue before, and if so, what's the best way to deal with it?

This is "Turkish i" issue - search for it. Like http://stackoverflow.com/questions/444798/case-insensitive-containsstring/15464440#15464440 or better yet http://stackoverflow.com/questions/3550213/in-c-sharp-what-is-the-difference-between-toupper-and-toupperinvariant/3550226#3550226 — Alexei Levenkov, Apr 04 '13 at 21:44
perhaps you have a `typo` where you are wanting to do the replace I mean the `i` and the `l` are right by each other on the keyboard — MethodMan, Apr 04 '13 at 21:45

Joni · Accepted Answer · 2013-04-04T22:12:05.947

5

In Turkish there are two i's: one with a dot, i, and one without a dot, ı. In upper case the first one has a dot, İ, and the second one hasn't, I.

At some point your program is converting InitialDirectory to lower case according to the default locale, which is known to be Turkish in some parts of the world. To fix the problem you can convert cases using a fixed, known locale, such as American English.

Update: Even better, use the ToLowerInvariant() method which converts a string to lower case in the "invariant culture".

edited Apr 04 '13 at 22:12

answered Apr 04 '13 at 21:47

Joni

108,737
14
143
193

2

I think it's better to use invariant culture instead of `en-US`. To make this easier, there's even a shortcut for that: [`ToLowerInvariant()`](http://msdn.microsoft.com/en-us/library/system.string.tolowerinvariant.aspx). – svick Apr 04 '13 at 21:52
@Joni - thanks for the answer. You are correct that I am indeed doing `ToLower` on the strings in this class. I do actually use `ToLowerInvariant` - but generally only on data that the user directly enters or modifies. Anything fixed I still use `ToLower` on... perhaps I should rethink that one! Sounds like you have hit the nail on the head though, I shall change them and see if the user still has the problem. Thanks again! – Richard Moss Apr 04 '13 at 22:13
1

There's a good article about this issue here: http://www.i18nguy.com/unicode/turkish-i18n.html – Matthew Watson Apr 04 '13 at 22:58

Why does "i" get replaced with "ı"

1 Answers1