C++11 case insensitive comparison of beginning of a string (unicode)

Question

I have to check if the particular string begins with another one. Strings are encoded using utf8, and a comparison should be case insensitive.

I know that this is very similar to that topic Case insensitive string comparison in C++ but I do not want to use the boost library and I prefer portable solutions (If it is 'nearly' impossible, I prefer Linux oriented solutions).

Is it possible in C++11 using its regexp library? Or just using simple string compare methods?

Why don't you want to use boost (its practically standard on all development machines nowadays). — Martin York, May 04 '12 at 07:30
Try a portable unicode compliant string library such as ICU. Though, I really don't see why you can use one portable solution and not another. — dirkgently, May 04 '12 at 07:33
It might seem simple but there are far more issues than you may think. First, there are many different possible representations for visual characters: for instance, the character `é` has its own code point, but can also be achieved by using the character `e` followed by the acute accent code point. Your solution needs to be aware of that. Second, case-insensitive comparison usually takes the strings and uppercases/lowercases them. This is actually a locale-sensitive operation: for instance, the German letter `ß` is the shorthand for `ss` and its uppercase version is `SS`. — zneak, May 04 '12 at 07:34
In other words, you certainly don't want to roll your own library for Unicode string manipulation, and since C++ doesn't have built-in features for that, you'll have to choose your poison. — zneak, May 04 '12 at 07:35
@Loki Astari cause my supervisor strongly discourage me to do that. And I prefer to do not argue with her. — Dejwi, May 04 '12 at 07:37
OK. Learning to do it manually for educational purposes is a good reason. But once you get to the real world stl/boost are indispensable. — Martin York, May 04 '12 at 07:41
Ah! You must then start with Unicode's website(). There used to be some source code to get your started too. — dirkgently, May 04 '12 at 07:44
I think he just means that his supervisor explicitly recommended not using boost. I know a lot of old hands that have an irrational fear of boost stemming from its early days, the (even now) poor documentation, the lengthy and complicated build procedures on platforms without ready-made distributions, and the sheer size of the thing. — Mahmoud Al-Qudsi, May 04 '12 at 07:45
@Mahmoud : Yeah, `./bootstrap.sh && b2` is really complicated. ;-] — ildjarn, May 04 '12 at 16:09
@ildjarn I last built boost on Windows many years ago, but I seem to recall it was a lot more involved than that. — Mahmoud Al-Qudsi, May 04 '12 at 16:56
@Mahmoud : It hasn't been any more involved than that for at least two years, unless you need to cross-compile. :-] — ildjarn, May 04 '12 at 16:59

score 13 · Answer 1 · answered May 04 '12 at 07:38

The only way I know of that is UTF8/internationalization/culture-aware is the excellent and well-maintained IBM ICU: International Components for Unicode. It's a C/C++ library for *nix or Windows into which a ton of research has gone to provide a culture-aware string library, including case-insensitive string comparison that's both fast and accurate.

IMHO, the two things you should never write yourself unless you're doing a thesis paper are encryption and culture-sensitive string libraries.

I am not sure those are the only two, but I completely agree that it's not something one is likely to get right! — Matthieu M., May 04 '12 at 08:16

score 3 · Answer 2 · answered May 04 '12 at 08:02

3

Are there any restrictions on what can be in the string you're looking for? It it's user input, and can be any UTF-8 string, the problem is extremely complex. As others have mentioned, one character can have several different representations, so you'd probably have to normalize the strings first. Then: what counts as equal? Should 'E' compare equal to 'é' (as is usual in some circles in French), or not (which would be conform to the "official" rules of the Imprimerie nationale).

For all but the most trivial definitions, rolling your own will represent a significant effort. For this sort of thing, the library ICU is the reference. It contains all that you'll need. Note however that it works on UTF16, not UTF8, so you'll have to convert the strings first, as well as normalizing them. (ICU has support for both.)

answered May 04 '12 at 08:02

James Kanze

150,581
18
184
329

It is quite unfortunate that it settled on UTF-16. I wish for a version of the library that would deal with UTF-8 directly instead :x – Matthieu M. May 04 '12 at 08:18
I have to filter out a list of names and surnames. And they can contain characters from any latin based alphabet. I thought that every national character (like é) has its own uppercase variant. And should be equal only to it. – Dejwi May 04 '12 at 08:23
@MiniKarol The equivalence between characters is very locale dependent. In French, it's common (although IMHO not good practice) to omit accents on upper case, so `'E'` would be the (ambiguous) upper case for `'e'`, `'é'`, `'è'`, `'ë'` and `'ê'`. In Swiss German, `"Ae"` is the standard upper case for `'ä'`. (Note that the upper case requires two code points, where the lower case may be only a single code point.) – James Kanze May 04 '12 at 08:34
@MiniKarol Not to mention the German `'ß'`, whose upper case form depends on the word (at least according to Duden). You can ignore accents entirely by converting to the Normalized form D and ignoring the various combining accents in the text; this is a simple (but not too accurate) solution, but will still not work in cases where the number of code points in upper case and lower case are different. – James Kanze May 04 '12 at 08:37

AquilaRapax · Answer 3 · 2012-05-04T08:50:32.543

2

Using the stl regex classes you could do something like the following snippet. Unfortunately its not utf8. Changing str2 to std::wstring str2 = L"hello World" results in a lot of conversion warnings. Making str1 an std::wchar doesn't work at all, since std::regex doesn't allow a whar input (as far as i can see).

#include <regex>
#include <iostream>
#include <string>

int main()
{
    //The input strings
    std::string str1 = "Hello";
    std::string str2 = "hello World";

    //Define the regular expression using case-insensitivity
    std::regex regx(str1, std::regex_constants::icase);

    //Only search at the beginning 
    std::regex_constants::match_flag_type fl = std::regex_constants::match_continuous;

    //display some output
    std::cout << std::boolalpha << std::regex_search(str2.begin(), str2.end(), regx, fl) << std::endl;

    return 0;
}

edited May 04 '12 at 08:50

answered May 04 '12 at 07:57

AquilaRapax

1,086
8
22

Thats right. But in my answer i said, that std::regex doesn't work with wchar and so i hoped it is a valid answer anyway, since it answers the first question with "no" – AquilaRapax May 04 '12 at 08:38
The thing is, **utf-8** use a regular `std::string`, so `char` under the hood. – Matthieu M. May 04 '12 at 11:02
So lets say we have an utf8 implementation based on chars, which mean one utf8-sign could be represented as an string of 1-4 length. Wouldn't utf8 words look like a "normal" string and could thus be handled by the regex-class? Ok, case-insensitivity wouldn't work, but theoretically... – AquilaRapax May 04 '12 at 11:41
It can work with the regex class for a number of regular expressions. For example `(.*?)` will happily capture everything in the `name` tag, utf-8 or not. However it will fail as soon as you start using shortcuts. For example `\w` is equivalent to `[a-zA-Z_]` so it's alpha... for ASCII, and will not match letters outside the Latin alphabet or hyphenated letters, etc... Also, since it does not know about multibyte encoding, even `([^<]{1,26})` may not work as desired: it will capture from 1 to 26 **bytes**, not codepoints or characters. – Matthieu M. May 04 '12 at 12:30

C++11 case insensitive comparison of beginning of a string (unicode)

3 Answers3

Linked

Related