How to achieve unicode-agnostic case insensitive comparison in C++

Question

I have a requirement wherein my C++ code needs to do case insensitive comparison without worrying about whether the string is encoded or not, or the type of encoding involved. The string could be an ASCII or a non-ASCII, I just need to store it as is and compare it with a second string without concerning if the right locale is set and so forth.

Use case: Suppose my application receives a string (let's say it's a file name) initially as "Zoë Saldaña.txt" and it stores it as is. Subsequently, it receives another string "zoë saLdañA.txt", and the comparison between this and the first string should result in a match, by using a few APIs. Same with file name "abc.txt" and "AbC.txt".

I read about IBM's ICU and how it uses UTF-16 encoding by default. I'm curious to know:

If ICU provides a means of solving my requirement by seamlessly handling the strings regardless of their encoding type?
If the answer to 1. is no, then, using ICU's APIs, is it safe to normalize all strings (both ASCII and non-ASCII) to UTF-16 and then do the case-insensitive comparison and other operations?
Are there alternatives that facilitate this?

I read this post, but it doesn't quite meet my requirements.

Thanks!

You can't do case-insensitive compare without knowing the locale. In Turkey, "FILE" should *not* match "file". ("FİLE" should match "file", and "FILE" should match "fıle"). In case it's not obvious, Turkish has a dotted-i, (i and İ) and a dotless-i (ı and I). — Martin Bonner supports Monica, Mar 29 '16 at 11:36
How are your strings encoded? You can't do anything useful unless you at least know what the source encoding is. — 一二三, Mar 29 '16 at 11:38
The use case really is a bit silly. For file names, you can't arbitraily decide that they're case insensitive. Files on most Unix-like file systems _are_ case sensitive, whether you like it or not. For Windows/NTFS, they are case-insensitive _using the case table stored on that disk_ ! — MSalters, Mar 29 '16 at 11:58

score 7 · Accepted Answer · answered Mar 29 '16 at 11:41

7

The requirement is impossible. Computers don't work with characters, they work with numbers. But "case insensitive" comparisons are operations which work on characters. Locales determine which numbers correspond to which characters, and are therefore indispensible.

The above isn't just true for all progamming langguages, it's even true for case-sensitive comparisons. The mapping from character to number isn't always unique. That means that comparing two numbers doesn't work. There could be a locale where character 42 is equivalent to character 43. In Unicode, it's even worse. There are number sequences which have different lengths and still are equivalent. (precomposed and decomposed characters in particular)

answered Mar 29 '16 at 11:41

MSalters

173,980
10
155
350

6

Unicode is horrible, a complete and utter nightmare. The really scary thing is how much better it is than anything that went before. – Martin Bonner supports Monica Mar 29 '16 at 11:49
2

@MartinBonner: Well, with about 6000 languages (ignoring dialects) having a single encoding is a challenge. That said, you could in theory introduce a "Unicode Light" without the outright silly stuff like emoji's, but at what cost? – MSalters Mar 29 '16 at 11:55
2

Emoji's are no problem at all (they are just "characters"). It's stuff like "LATIN CAPITAL LETTER A WITH RING ABOVE", and "LATIN CAPITAL LETTER A"+"COMBINING RING ABOVE" (not forgetting "ANGSTROM SIGN") which I think is a nightmare. (And how "LATIN CAPITAL LETTER A"+"COMBINING RING ABOVE"+"COMBINING CEDILA" should compare equal to "LATIN CAPITAL LETTER A"+"COMBINING CEDILA"+"COMBINING RING ABOVE"). The *real* difficulty though is that real languages are complicated: my favourite example is that the lower case version of "MASS" in German is "Maß" or "Mass" depending on which word it is. – Martin Bonner supports Monica Mar 29 '16 at 12:11
2

And yes, I did leave the "M" capitalized. – Martin Bonner supports Monica Mar 29 '16 at 12:11
I think that's _sentence case_, not lower case. _Lower case_ would be "mass" or "maß". It's just that German nouns are not supposed to be written in lower case. (IOW `tolower` is the wrong choice of algorithm, it's not buggy itself) – MSalters Mar 29 '16 at 13:07
@MSalters: Thanks for the response! Suppose the file name received is encoded with UTF-16. Using gconv (iconv) or ICU, would I be able to convert from UTF-16 to local character encoding and do the case-insensitive comparisons? Say, convert the encoding to ASCII and do the comparison? – Maddy Apr 01 '16 at 05:48
Or does the said approach bring down the performance of the application doing the case-insensitive comparisons? Is there a better way? – Maddy Apr 01 '16 at 06:00
1

@Maddy: Why convert? That only makes things a **lot** harder. Convert to ASCII, and literally 99.9% of Unicode characters become "?". I'm not making up a number there, Unicode literally has a thousand times more characters. Use ICU for a comparison of the unmodified strings. – MSalters Apr 01 '16 at 07:35
@MSalters: Are you suggesting that a Unicode string be stored as is, and use it to compare with another Unicode string via ICU, provided they both have a similar encoding type? – Maddy Apr 01 '16 at 08:22
Also, if the encoding type is something other than UTF-16, would it be wrong to convert it to UTF-16 format for comparison operations? Which other encoding format is safe to be converted into without losing out on anything? – Maddy Apr 01 '16 at 08:39
2

@Maddy: Load both strings to compare into ICU, tell ICU what encoding to use when reading in both strings, and let ICU compare the two _without worrying about ICU's internal encoding_. Any Unicode encoding is fine, whether it's UTF-8, UTF-16 or UTF-32. ICU understands them all. – MSalters Apr 01 '16 at 08:43
And I suppose the encoding that we tell the ICU to use ought to be the same as that of the strings? – Maddy Apr 01 '16 at 08:58
1

Re: "The requirement is impossible." - there is DUCET in Unicode's tr10 ( https://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table ) - using it would be exactly what is needed for a locale-agnostic case-insensitive "default" collation, which could not be universal, but instead it would be stable. – Mike Kaganski Jun 14 '21 at 10:09
@MikeKaganski Agreed, the Unicode standard supports the concept of locale neutral case folding. It doesn't work for every application, but sounds about right for the OP. I provided the gory details below in my answer. Anymore, it _may_ even be supported by ICU. Last time I looked, tho, ICU required the use of Locale everywhere. – Charlie Reitzel Sep 24 '21 at 18:11

Serge Ballesta · Answer 2 · 2016-03-29T12:17:24.943

Without knowing encoding, you cannot do that. I will take one example using french accented characters and 2 different encodings: cp850 used as OEM character for windows in west european zone, and the well known iso-8859-1 (also known as latin1, not very different from win1252 ansi character set for windows)).

in cp850, 0x96 is 'û', 0xca is '╩', 0xea is 'Û'
in latin1, 0x96 is non printable(*), 0xca is 'Ê', 0xea is 'ê'

so if string is cp850 encoded, 0xea should be the same as 0x96 and 0xca is a different character

but if string is latin1 encoded, 0xea should be the same as 0xca, 0x96 being a control character

You could find similar examples with other iso-8859-x encoding by I only speak of languages I know.

(*) in cp1252 0x96 is '–' unicode character U+2013 not related to 'ê'

Charlie Reitzel · Answer 3 · 2021-09-27T21:35:33.943

For UTF-8 (or other Unicode) encodings, it is possible to perform a "locale neutral" case-insensitive string comparison. This type of comparison is useful in multi-locale applications, e.g. network protocols (e.g. CIFS), international database data, etc.

The operation is possible due to Unicode metadata which clearly identifies which characters may be "folded" to/from which upper/lower case characters.

As of 2007, when I last looked, there are less than 2000 upper/lower case character pairs. It was also possible to generate a perfect hash function to convert upper to lower case (most likely vice versa, as well, but I didn't try it).

At the time, I used Bob Burtle's perfect hash generator. It worked great in a CIFS implementation I was working on at the time.

There aren't many smallish, fixed sets of data out there you can point a perfect hash generator at. But this is one of 'em. :--)

Note: this is locale-neutral. So it will not support applications like German telephone books. There are a great many applications you should definitely use locale aware folding and collation. But there are a large number where locale neutral is actually preferable. Especially now when folks are sharing data across so many time zones and, necessarily, cultures. The Unicode standard does a good job of defining a good set of shared rules.

If you're not using Unicode, the presumption is that you have a really good reason. As a practical matter, if you have to deal with other character encodings, you have a highly locale aware application. In which case, the OP's question doesn't apply.

See also:

The Unicode® Standard, Chapter 4, section 4.2, Case
The Unicode® Standard, Chapter 5, section 5.18, Case Mappings, subsection Caseless Matching.
UCD - CaseFolding.txt

How do I tell the APIs that do Unicode-aware case-folded comparisons (e.g., `_wcscmp` on Windows) to do such a "locale neutral" comparison? Or does every developer need to implement the algorithm given in section *Caseless Matching* at https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G21790 ? — Francis Litterio, Jan 02 '23 at 16:17
@FrancisLitterio It's impossible to say in the abstract. You have to read the documentation and/or source code for each specific implementation. For your example, see the [Microsoft documentation](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/stricmp-wcsicmp-mbsicmp-stricmp-l-wcsicmp-l-mbsicmp-l). The C runtime library (along with many other APIs) uses the notion of code pages. On most systems, there are code pages for unicode (UTF-8, UTF-16). Windows uses UTF-16LE internally for strings, btw. — Charlie Reitzel, Jan 03 '23 at 18:01

score 0 · Answer 4 · answered Oct 21 '20 at 16:55

Well, first I must say that any programmer dealing with natural language text has the utmost duty to know and understand Unicode well. Other ancient 20th Century encodings still exists, but things like EBCDIC and ASCII are not able to encode even a simple English text, which may contain words like façade, naïve or fiancée or even a geographical sign, a mathematical symbol or even emojis — conceptually, they are similar to ideograms. The majority of the world population does not use Latin characters to write text. UTF-8 is now the prevalent encoding on the Internet, and UTF-16 is used internally by all present day operating systems, including Windows, which unfortunately still does it wrong. (For example, NTFS has a decade-long reported bug that allows a directory to contain 2 files with names that look exactly the same but are encoded with different normal forms — I get this a lot when synchronising files via FTP between Windows and MacOS or Linux; all my files with accented characters get duplicated because unlike the other systems, Windows uses a different normal forms and only normalise the file names on the GUI level, not on the file system level. I reported this in 2001 for Windows 7 and the bug is still present today in Windows 10.)

If you still don't know what a normal form is, start here: https://en.wikipedia.org/wiki/Unicode_equivalence

Unicode has strict rules for lower- and uppercase conversion, and these should be followed to the point in order for things to work nicely. First, make sure both strings use the same normal form (you should do this in the input process, the Unicode standard has the algorithm). Please do not reinvent the wheel, use ICU normalising and comparison facilities. They have been extensively tested and they work correctly. Use them, IBM has made it gratis.

A note: if you plan on comparing string for ordering, please remember that collation is locale-dependant, and highly influenced by the language and the scenery. For example, in a dictionary these Portuguese words would have this exact order: sabia, sabiá, sábia, sábio. The same ordering rules would not work for an address list, which would use phonetic rules to place names like Peçanha and Pessanha adjacently. The same phenomenon happens in German with ß and ss. Yes, natural language is not logical — or better saying, its rules are not simple.

C'est la vie. これが私たちの世界です。

How to achieve unicode-agnostic case insensitive comparison in C++

4 Answers4

Linked