String search with Turkish dotless i

Question

When searching the text Çınaraltı Café for the text Ci using the code

NSStringCompareOptions options =
    NSCaseInsensitiveSearch |
    NSDiacriticInsensitiveSearch |
    NSWidthInsensitiveSearch;
NSLocale *locale = [NSLocale localeWithLocaleIdentifier:@"tr"];
NSRange range = [haystack rangeOfString:needle 
                                options:options
                                  range:NSMakeRange(o, haystack.length)
                                 locale:locale];

I get range.location equals NSNotFound.

It's not to do with the diacritic on the initial Ç because I get the same result searching for alti where the only odd character is the ı. I also get a valid match searching for Cafe which contains a diacritic (the é).

The apple docs mention this situation as notes on the locale parameter and I think I'm following them. Though I guess I'm not because it's not working.

How can I get a search for 'i' to match both 'i' and 'ı'?

I don't think it's worth the effort of searching through Apple documentation, I would just use a regex at your place. — Ramy Al Zuhouri, Jul 08 '13 at 23:19
The docs you mention cover a different situation than you have here. If you have a string with the uppercase dotless i and you do a case-insensitive search for it with a regular i then it will work fine unless you use the Turkish locale. With the Turkish locale, the uppercase dotless i can only be found with a lowercase dotless i, not a regular i. I did a few tests and regardless of locale, there doesn't seem to be any way to find match the dotless i with a regular i. Perhaps it is a bug. — rmaddy, Jul 08 '13 at 23:54
@rmaddy I just assumed that if a case insensitive search for `I` matches both `i` and `ı` then _surely_ a case insensitive search for `i` matches both `i` and `ı`. Perhaps I just need to know more about the Turkish language :( — deanWombourne, Jul 09 '13 at 09:59
Please note that the dotted i (i and İ) is a proper, bona fide letter in the Turkish alphabet. The dot it has is not a diacritic. The dot is not modifying a dotless i (ı and I), which is itself also a proper letter. What you are seeing (I matching i and ı) may be a bug on Apple's part, since it does not seem to be commutative. If you have any further questions w/ Turkish, I'll be happy to help. — Sabuncu, Jul 09 '13 at 17:02
Also: +1 for the needle and haystack variables. Makes it very clear! — Sabuncu, Jul 09 '13 at 17:06
Did you really use `needle.length` in your input range? At least when searching for "alti", that would be wrong. In general, you want to use `haystack.length` to search over the entirety of `haystack`. — Ken Thomases, Jul 25 '14 at 01:26

Tim · Answer 1 · 2013-07-25T00:40:39.437

I don't know whether this helps as an answer, but perhaps explains why it's happening.

I should point out I'm not an expert in this matter, but I've been looking into this for my own purposes and been doing some research.

Looking at the Unicode collation chart for latin, the equivalent characters to ASCII "i" (\u0069) do not include "ı" (\u0131), whereas all the other letters in your example string are as you expect, i.e.:

"c" (\u0063) does include "Ç" (\u00c7)
"e" (\u0065) does include "é" (\u00e9)

The ı character is listed separately as being of primary difference to i. That might not make sense to a Turkish speaker (I'm not one) but it's what Unicode have to say about it, and it does fit the logic of the problem you describe.

In Chrome you can see this in action with an in-page search. Searching in the page for ASCII i highlights all the characters in its block and does not match ı. Searching for ı does the opposite.

By contrast, MySQL's utf8_general_ci collation table maps uppercase ASCII I to ı as you want.

So, without knowing anything about iOS, I'm assuming it's using the Unicode standard and normalising all characters to latin by this table.

As to how you match Çınaraltı with Ci - if you can't override the collation table then perhaps you can just replace i in your search strings with a regular expression, so you search on Ç[iı] instead.

Further to this, I've been [playing with transliteration in JavaScript](http://apps.timwhitlock.info/js/translit) — Tim, Jul 28 '13 at 18:09

score 3 · Answer 2 · answered Feb 05 '17 at 09:28

I wrote a simple extension in Swift 3 for Turkish string search.

let turkishSentence = "Türkçe ya da Türk dili, batıda Balkanlar’dan başlayıp doğuda Hazar Denizi sahasına kadar konuşulan Altay dillerinden biridir."
let turkishWannabe = "basLayip"

let shouldBeTrue = turkishSentence.contains(turkishString: turkishWannabe, caseSensitive: false)
let shouldBeFalse = turkishSentence.contains(turkishString: turkishWannabe, caseSensitive: true)

You can check it out from https://github.com/alpkeser/swift_turkish_string_search/blob/master/TurkishTextSearch.playground/Contents.swift

akaralar · Accepted Answer · 2014-07-30T16:53:58.287

1

I did this and seems to work well for me.. hope it helps!

NSString *cleanedHaystack = [haystack stringByReplacingOccurrencesOfString:@"ı"
                                                                withString:@"i"];
cleanedHaystack = [cleanedHaystack stringByReplacingOccurrencesOfString:@"İ"
                                                             withString:@"I"];

NSString *cleanedNeedle = [needle stringByReplacingOccurrencesOfString:@"ı"
                                                            withString:@"i"];
cleanedNeedle = [cleanedNeedle stringByReplacingOccurrencesOfString:@"İ"
                                                         withString:@"I"];

NSUInteger options = (NSDiacriticInsensitiveSearch |
                      NSCaseInsensitiveSearch |
                      NSWidthInsensitiveSearch);
NSRange range = [cleanedHaystack rangeOfString:cleanedNeedle
                                       options:options];

edited Jul 30 '14 at 16:53

answered Jul 25 '14 at 01:12

akaralar

1,103
1
10
29

Yep, that works _in this exact case_ - unfortunately, I don't control the input data (it's entered by editors in Turkey) so there will be other texts that don't match correctly. This just happened to be the first that I spotted! I'm hoping there's a general solution to my problem. I suspect that this is unsolvable beacuse I don't really understand Turkish - they might just be different letters; it might be like expecting a en-gb pattern match for 'a' to match 'b' :| – deanWombourne Aug 08 '14 at 12:34
1

@deanWombourne My native language is Turkish and I can confirm this is the only edge case, the diacritic insensitive search covers every case except this one. I am using this in my projects and haven't had your problem yet, so i hope it helps! :) – akaralar Aug 08 '14 at 15:28
yes that's incredibly helpful, thank you! Looks like I can get away with just string replacing :) – deanWombourne Aug 29 '14 at 09:17

score 1 · Answer 4 · answered Sep 23 '14 at 10:45

As Tim mentions, we can use regular expression to match text containing i or ı. I also didn't want to add a new field or change the source data as the search looks up huge amounts of string. So I ended up a solution using regular expressions and NSPredicate.

Create NSString category and copy this method. It returns basic or matching pattern. You can use it with any method that accepts regular expression pattern.

- (NSString *)zst_regexForTurkishLettersWithCaseSensitive:(BOOL)caseSensitive
{
    NSMutableString *filterWordRegex = [NSMutableString string];
    for (NSUInteger i = 0; i < self.length; i++) {
        NSString *letter = [self substringWithRange:NSMakeRange(i, 1)];
        if (caseSensitive) {
            if ([letter isEqualToString:@"ı"] || [letter isEqualToString:@"i"]) {
                letter = @"[ıi]";
            } else if ([letter isEqualToString:@"I"] || [letter isEqualToString:@"İ"]) {
                letter = @"[Iİ]";
            }
        } else {
            if ([letter isEqualToString:@"ı"] || [letter isEqualToString:@"i"] ||
                [letter isEqualToString:@"I"] || [letter isEqualToString:@"İ"]) {
                letter = @"[ıiIİ]";
            }
        }
        [filterWordRegex appendString:letter];
    }
    return filterWordRegex;
}

So if the search word is Şırnak, it creates Ş[ıi]rnak for case sensitive and Ş[ıiIİ]rnak for case insensitive search.

And here are the possible usages.

NSString *testString = @"Şırnak";

// First create your search regular expression.
NSString *searchWord = @"şır";
NSString *searchPattern = [searchWord zst_regexForTurkishLettersWithCaseSensitive:NO];

// Then create your matching pattern.
NSString *pattern = searchPattern; // Direct match
// NSString *pattern = [NSString stringWithFormat:@".*%@.*", searchPattern]; // Contains
// NSString *pattern = [NSString stringWithFormat:@"\\b%@.*", searchPattern]; // Begins with

// NSPredicate
// c for case insensitive, d for diacritic insensitive
NSPredicate *predicate = [NSPredicate predicateWithFormat:@"self matches[cd] %@", pattern]; 
if ([predicate evaluateWithObject:testString]) {
    // Matches
}

// If you want to filter an array of objects
NSArray *matchedCities = [allAirports filteredArrayUsingPredicate:
    [NSPredicate predicateWithFormat:@"city matches[cd] %@", pattern]];

You can also use NSRegularExpression but I think using case and diacritic insensitive search with NSPredicate is much more simpler.

String search with Turkish dotless i

4 Answers4