How to correctly skip unicode (UTF-8) characters?

Question

I have written a parser that turns out works incorrectly with UTF-8 texts.

The parser is very very simple:

while(pos < end) { 

// find some ASCII char
if (text.at(pos) == '@') {
// Check some conditions and if the syntax is wrong...
if (...)
  createDiagnostic(pos);
} 

pos++;
}

So you can see I am creating a diagnostic at pos. But that pos is wrong if there were some UTF-8 characters (because UTF-8 characters in reality consists of more than one char. How do I correctly skip the UTF-8 chars as if they are one character?

I need this because the diagnostics are sent to UTF-8-aware VSCode.

I tried to read some articles on UTF-8 in C++ but every material I found is huge. And I only need to skip the UTF-8.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/199821/discussion-on-question-by-nurbol-alpysbayev-how-to-correctly-skip-unicode-utf-8). — Samuel Liew, Sep 23 '19 at 07:35

geza · Accepted Answer · 2019-09-22T12:20:37.707

1

If the code point is less than 128, then UTF-8 encodes it as ASCII (No highest bit set). If code point is equal or larger than 128, all the encoded bytes will have the highest bit set. So, this will work:

unsigned char b = <...>; // b is a byte from a utf-8 string
if (b&0x80) {
    // ignore it, as b is part of a >=128 codepoint
} else {
    // use b as an ASCII code
}

Note: if you want to calculate the number of UTF-8 codepoints in a string, then you have to count bytes with:

!(b&0x80): this means that the byte is an ASCII character, or
(b&0xc0)==0xc0: this means, that the byte is the first byte of a multi-byte UTF8-sequence

edited Sep 22 '19 at 12:20

answered Sep 22 '19 at 09:26

geza

28,403
6
61
135

Wow, this seems to work! Thank you!!! Could you please explain how this works? I mean, why 128? The real question behind that is: can this approach be *reliably* used for any utf-8 char? Also, is there a special reason to use `0x80` instead of just `128`? – Nurbol Alpysbayev Sep 22 '19 at 11:30
@NurbolAlpysbayev: What do you mean by "why 128"? UTF-8 is compatible with ASCII, it has the same 128 characters mapped. `0x80` is the same as `128`, it is a personal preference. When checking bits, I usually use hexadecimal values instead of decimals. I could have written `(1<<7)` as well, so it is clear, that this value has only one bit set. Or, I could have written `b>=128` too. – geza Sep 22 '19 at 11:52
1

@NurbolAlpysbayev: and yes, this should be reliable. If `b` is less than 128, it must be a character. If it is `>=128`, then it is surely part of a multi-byte sequence, or it is an invalid byte (which doesn't do any harm in this case, as we'll ignore it). – geza Sep 22 '19 at 11:54
Aahh, so the idea is that, ASCII charcodes are less then 128, and UTF-8 are more or equal? Wow, it's so simple, that I am not sure why other people didn't suggest this. BTW, thank you soooo much!!! You saved my life – Nurbol Alpysbayev Sep 22 '19 at 11:55
1

@NurbolAlpysbayev: yes. Read the 'Backward compatibility' part at wikipedia: https://en.wikipedia.org/wiki/UTF-8. UTF-8 intentionally designed to have this property. "UTF-8 are more or equal": only, if UTF-8 actually encodes a non-ASCII character. – geza Sep 22 '19 at 11:58
Great read at Wikipedia and surprisingly simple! God bless Stackoverflow, and the users like you :-) – Nurbol Alpysbayev Sep 22 '19 at 12:02
1

@NurbolAlpysbayev: I'm happy that helped. Yep, UTF-8 is simple, and very well designed. – geza Sep 22 '19 at 12:11
Just was going to come here and complain that not all UTF-8 sequences have `length() == 2` so I can't rely on the length, when I saw your addition on how to find the first byte of a UTF-8 sequence! Just on point!!! :-) Also, seems to work with UTF-16 as well. I was going to ask about the meaning of `0xc0`, but found that a great explanation already exists here: https://stackoverflow.com/questions/3911536/utf-8-unicode-whats-with-0xc0-and-0x80 – Nurbol Alpysbayev Sep 23 '19 at 02:16
BTW wouldn't `(b & 0xc0) != 0x80` be more correct/solid than `(b&0xc0)==0xc0` if we want to handle single-byte (turns out they exist, and they start with 0, not with 1, see the link in the prev comment) UTF-8 sequences as well? – Nurbol Alpysbayev Sep 23 '19 at 02:25
1

@NurbolAlpysbayev: with my solution, the condition `!(b&0x80)` handles them. My approach is a little bit different, but the outcome is exactly the same. The semantic of my approach is "count single-byte and head-of-multi-byte sequence bytes", that answer's approach is "count non-continuation-bytes". They are the same. Nice answer, btw., I recommend you to use that approach instead of mine, as it is a little bit simpler. These methods surely don't work with UTF-16 correctly, maybe it's just a coincidence that they seem to work. – geza Sep 23 '19 at 07:39
1

@NurbolAlpysbayev: and note, that maybe you'll have problems with code points `>=0x10000`. They get encoded in two "characters" in UTF-16 (search for "surrogate pairs"). I'm not sure how JS handles them. – geza Sep 23 '19 at 07:50

How to correctly skip unicode (UTF-8) characters?

1 Answers1

Linked