7

I am looking for a method to compare and sort UTF-8 strings in C++ in a case-insensitive manner to use it in a custom collation function in SQLite.

  1. The method should ideally be locale-independent. However I won't be holding my breath, as far as I know, collation is very language-dependent, so anything that works on languages other than English will do, even if it means switching locales.
  2. Options include using standard C or C++ library or a small (suitable for embedded system) and non-GPL (suitable for a proprietary system) third-party library.

What I have so far:

  1. strcoll with C locales and std::collate/std::collate_byname are case-sensitive. (Are there case-insensitive versions of these?)
  2. I tried to use a POSIX strcasecmp, but it seems to be not defined for locales other than "POSIX"

    In the POSIX locale, strcasecmp() and strncasecmp() do upper to lower conversions, then a byte comparison. The results are unspecified in other locales.

    And, indeed, the result of strcasecmp does not change between locales on Linux with GLIBC.

    #include <clocale>
    #include <cstdio>
    #include <cassert>
    #include <cstring>
    
    const static char *s1 = "Äaa";
    const static char *s2 = "äaa";
    
    int main() {
        printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
        printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
        assert(setlocale(LC_ALL, "en_AU.UTF-8"));
        printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
        printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
        assert(setlocale(LC_ALL, "fi_FI.UTF-8"));
        printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
        printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
    }
    

    This is printed:

    strcasecmp('Äaa', 'äaa') == -32
    strcoll('Äaa', 'äaa') == -32
    strcasecmp('Äaa', 'äaa') == -32
    strcoll('Äaa', 'äaa') == 7
    strcasecmp('Äaa', 'äaa') == -32
    strcoll('Äaa', 'äaa') == 7
    

P. S.

And yes, I am aware about ICU, but we can't use it on the embedded platform due to its enormous size.

chills42
  • 14,201
  • 3
  • 42
  • 77
Alex B
  • 82,554
  • 44
  • 203
  • 280

6 Answers6

7

What you really want is logically impossible. There is no locale-independent, case-insensitive way of sorting strings. The simple counter-example is "i" <> "I" ? The naive answer is no, but in Turkish these strings are unequal. "i" is uppercased to "İ" (U+130 Latin Capital I with dot above)

UTF-8 strings add extra complexity to the question. They're perfectly valid multi-byte char* strings, if you have an appropriate locale. But neither the C nor the C++ standard defines such a locale; check with your vendor (too many embedded vendors, sorry, no genearl answer here). So you HAVE to pick a locale whose multi-byte encoding is UTF-8, for the mbscmp function to work. This of course influences the sort order, which is locale dependent. And if you have NO locale in which const char* is UTF-8, you can't use this trick at all. (As I understand it, Microsoft's CRT suffers from this. Their multi-byte code only handles characters up to 2 bytes; UTF-8 needs 3)

wchar_t is not the standard solution either. It supposedly is so wide that you don't have to deal with multi-byte encodings, but your collation will still depend on locale (LC_COLLATE) . However, using wchar_t means you now choose locales that do not use UTF-8 for const char*.

With this done, you can basically write your own ordering by converting strings to lowercase and comparing them. It's not perfect. Do you expect L"ß" == L"ss" ? They're not even the same length. Yet, for a German you have to consider them equal. Can you live with that?

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • 2
    About your example with the German "ß" character (and all such abundant cases): these must have been "solved" or otherwise dealt with thousands of times before, UTF-8 or no. MS Word has always had a "toggle case" feature - how did it work on that character in pre-Unicode versions? How did WordPerfect? I am having the same problem as the OP, except I work in Delphi. I've seen a number of Windows sqlite-based apps that perform a case-insensitive SELECT (and I guess ORDER BY), whether they are installed in an English, German or (in my case) Polish locale. Try Firefox :) How do they do that? – Marek Jedliński Oct 17 '09 at 23:19
  • Usually incorrect :) Polish has IIRC no hard cases; all non-ASCII characters used in Polish are "based on" ASCII characters. – MSalters Oct 19 '09 at 08:19
  • Except for the Turkish I problem, the Unicode Case Folding algorithm (http://www.unicode.org/reports/tr44/) works remarkably well. – dalle Jun 09 '13 at 21:11
  • UTF-8 can have up to 4 bytes per codepoint, and a grapheme can be multiple (the exact max is unspecified IIRC) codepoints. – MarcusJ Dec 17 '17 at 10:59
0

I don't think there's a standard C/C++ library function you can use. You'll have to roll your own or use a 3rd-party library. The full Unicode specification for locale-specific collation can be found here: http://www.unicode.org/reports/tr10/ (warning: this is a long document).

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
0

On Windows you can call fall back on the OS function CompareStringW and use the NORM_IGNORECASE flag. You'll have to convert your UTF-8 strings to UTF-16 first. Otherwise, take a look at IBM's International Components for Unicode.

Harold Ekstrom
  • 1,538
  • 8
  • 7
0

I believe you will need to roll your own or use an third party library. I recommend a third party library because there are a lot of rules that need to be followed to get true international support - best to let someone who is an expert deal with them.

Ray
  • 106
  • 1
  • 6
0

I have no definitive answer in the form of example code, but I should point out that an UTF-8 bytestream contains, in fact, Unicode characters and you have to use the wchar_t versions of the C/C++ runtime library.

You have to convert those UTF-8 bytes into wchar_t strings first, though. This is not very hard, as the UTF-8 encoding standard is very well documented. I know this, because I've done it, but I can't share that code with you.

Dave Van den Eynde
  • 17,020
  • 7
  • 59
  • 90
0

If you are using it to do searching and sorting for your locale only, I suggest your function to call a simple replace function that convert both multi-byte strings into one byte per char ones using a table like:

A -> a
à -> a
á -> a
ß -> ss
Ç -> c
and so on

Then simply call strcmp and return the results.