Getting the correct Collator setting in ICU

Question

The requirement is to be able to do case insensitive operations on both ASCII and Unicode strings. Each input string is encoded using UTF-16LE and stored as a std::basic_string<u_int16_t> data type. The majority of suggestions pointed at ICU, so I took a stab at it.

I wrote a sample code to try out a few sample inputs:

#include <iostream.h>
#include "unicode/coll.h"

using namespace icu;
using namespace std;

int main()
{
    UErrorCode success = U_ZERO_ERROR;
    Collator *collator = Collator::createInstance("UTF-16LE", success);
    collator->setStrength(Collator::PRIMARY);

    if (collator->compare("dinç", "DINÇ") == 0) {
        cout << "Strings are equal" << endl;
    } else {
        cout << "Strings are unequal" << endl;
    }
    return 0;
}

The strings in question have turkish characters. From what I read, the string comparison should fail since 'i' and 'I' are different in character set regardless of whether they're both upper or lower case. But they are deemed equal.

A couple questions:

Should the strings be UTF-16 encoded prior to feeding them to ICU? Would that solve the problem?
In general, which collator settings are ideal to support case insensitive operations on UTF-16 encoded strings? I read that when strength is set to PRIMARY and SECONDARY, it results in case insensitive comparison. In addition to this, is there any thing else that I might be missing?

Thanks!

score 1 · Accepted Answer · answered Apr 12 '16 at 10:30

1

In addition to this, is there any thing else that I might be missing?

YES! Your code is missing the Turkish.

The Unicode casing rules are kinda simple, until you get Turkish in there†. Turkish Is are messy. The uppercase form of i is İ, not I, and the lowercase form of I, is ı, not i; and the pair i/İ denotes a different letter from the pair ı/I.

This means that there are two different sets of rules for case-insensitive comparison: one where i is equal to I (most locales), and one where it is different (for Turkish and Azerbaijani locales).

In order to get the Turkish locale semantics with ICU you need to create a collator with a specific locale, in this case the tr_TR locale.

† not only Turkish. There are four languages with weird casing rules; from least messy to hellish: Turkish and Azeri, Lithuanian, Greek.

answered Apr 12 '16 at 10:30

R. Martinho Fernandes

228,013
71
433
510

Thanks! Barring the four languages you mentioned, would a collator with UTF-16 as locale be able to handle case insensitive operations for most other languages? – Maddy Apr 13 '16 at 03:52
Also this [post](http://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16) says that UTF-16 size is variable and can go up to 4 bytes. If so, the storage needed for the strings needs to change to `std::basic_string` or the equivalent, correct? – Maddy Apr 13 '16 at 04:11
1

@Maddy it would be able to do equality comparisons, but ordering (aka collation) varies a lot from language to language, so no. Also, no don't change your strings. UTF-16 uses 16-bit units. – R. Martinho Fernandes Apr 13 '16 at 10:20

Getting the correct Collator setting in ICU

1 Answers1