The requirement is to be able to do case insensitive operations on both ASCII and Unicode strings. Each input string is encoded using UTF-16LE
and stored as a std::basic_string<u_int16_t>
data type. The majority of suggestions pointed at ICU, so I took a stab at it.
I wrote a sample code to try out a few sample inputs:
#include <iostream.h>
#include "unicode/coll.h"
using namespace icu;
using namespace std;
int main()
{
UErrorCode success = U_ZERO_ERROR;
Collator *collator = Collator::createInstance("UTF-16LE", success);
collator->setStrength(Collator::PRIMARY);
if (collator->compare("dinç", "DINÇ") == 0) {
cout << "Strings are equal" << endl;
} else {
cout << "Strings are unequal" << endl;
}
return 0;
}
The strings in question have turkish characters. From what I read, the string comparison should fail since 'i'
and 'I'
are different in character set regardless of whether they're both upper or lower case. But they are deemed equal.
A couple questions:
Should the strings be UTF-16 encoded prior to feeding them to ICU? Would that solve the problem?
In general, which collator settings are ideal to support case insensitive operations on UTF-16 encoded strings? I read that when strength is set to PRIMARY and SECONDARY, it results in case insensitive comparison. In addition to this, is there any thing else that I might be missing?
Thanks!