Compare UTF-8 characters

Question

Here is a parsing function:

double transform_units(double val, const char* s)
{
    cout << s[0];
    if (s[0] == 'm') return val * 1e-3;
    else if (s[0] == 'µ') return val * 1e-6;
    else if (s[0] == 'n') return val * 1e-9;
    else if (s[0] == 'p') return val * 1e-12;
    else return val;
}

In the line with 'µ' I'm getting the warning:

warning: multi-character character constant [-Wmultichar]

and the character 'µ' is not being catched.

How to compare multibyte characters?

Edit: a dirty workaround is to check if it is less than zero. As Giacomo mentioned, it's 0xCE 0xBC, both these bytes are greater than 127, so less than zero. It works for me.

Compare them as strings, not as individual characters. `µ` doesn't fit in a single `char`, as the warning says. In UTF-8, it is 2 chars `0xC2 0xB5` — Remy Lebeau, Sep 06 '21 at 09:43
or `0xCE 0xBC` (Greek mu letter), so you may need to normalize — Giacomo Catenazzi, Sep 06 '21 at 09:54
C++ has the [char8_t](https://en.cppreference.com/w/cpp/language/types#char8_t) type for UTF8 characters and strings. You can specify UTF8 characters and literals with the `u8` prefix, eg `u8'μ'`,`u8'm'`. If you use the correct type you'll be able to compare characters directly. [UTF8 string literals](https://en.cppreference.com/w/cpp/language/string_literal) are available singe C++11 so even if your compiler doesn't support `char8_t`, you'll be able to use `u8'μ'` — Panagiotis Kanavos, Sep 06 '21 at 09:56
@PanagiotisKanavos "you'll be able to use u8'μ'" That's [not true](https://godbolt.org/z/9KrjWaYh9). — n. m. could be an AI, Sep 06 '21 at 10:28
There is no such thing as a UTF-8 character. UTF-8 is a variable-length encoding of Unicode characters. Each Unicode character is encoded as a sequence of bytes. In order to compare byte sequences, you compare byte sequences, not individual bytes. A byte sequence in C++ is normally represented as a character array. — n. m. could be an AI, Sep 06 '21 at 10:31
The edit mention a sign-test: don't. That is not portable as 'char' in a different implementation may be unsigned; (see e.g., https://stackoverflow.com/questions/75191/what-is-an-unsigned-char ) — Hans Olsson, Sep 06 '21 at 11:01

eerorika · Accepted Answer · 2021-09-06T14:02:44.363

2

How to compare multibyte characters?

You can compare a unicode code point consisting of multiple bytes (more generally, multiple code units) by using multiple bytes. s[0] is only a single char which is the size of a byte and thus cannot by itself contain multiple bytes.

This may work: std::strncmp(s, "µ", std::strlen("µ")) == 0.

edited Sep 06 '21 at 14:02

answered Sep 06 '21 at 10:46

eerorika

232,697
12
197
326

1

Note that this depends on how you save the source file (i.e., which encoding you use). – wovano Sep 06 '21 at 12:20

Compare UTF-8 characters

1 Answers1