Is there such a thing as non-utf8 character

Question

Trying to implement c++ code where we could use a non-utf8 char to be as delimiter inside a std::string.

Is there such a thing as a non-UTF-8 char ?

score 5 · Accepted Answer · answered Oct 02 '19 at 23:30

Yes. 0xC0, 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF are invalid UTF-8 code units. A UTF-8 code unit is 8 bits. If by char you mean an 8-bit byte, then the invalid UTF-8 code units would be char values that do not appear in UTF-8 encoded text.

Remy Lebeau · Answer 2 · 2019-10-02T23:08:05.363

3

std::string only knows about raw char values, it knows nothing about particular character encodings that use char to hold encoded values.

Many common UTF-8 implementations use char to hold encoded codeunits (though C++20 will introduce char8_t and std::u8string for this purpose instead). But other character encodings (Windows-12##, ISO-8859-#, etc) can also fit their encoded values in char elements, too.

Any char value that falls within the ASCII range (0x00 .. 0x7F) will fit in 1 char and map to the same codepoint value in Unicode (U+0000 .. U+007F), but any char value in the ANSI range but not in the ASCII range (0x80 .. 0xFF) is subject to interpretation by whatever character encoding created the char values. Some encodings use 1 char per character, some use multiple chars.

So yes, there is such a thing as a "non-UTF-8 char".

edited Oct 02 '19 at 23:08

answered Oct 02 '19 at 22:31

Remy Lebeau

555,201
31
458
770

But the C++-Standard still requires char to have an size of exactely 1 byte. Assuming the standard 8 bit = 1 byte, any utf8-char will allways fit into `char` – Anonymous Anonymous Oct 02 '19 at 22:56
A UTF-8 *encoded codeunit* can be made to fit in a `char`, yes. But UTF-8 is an 8-bit encoding, but `char` may be either *signed* or *unsigned* depending on compiler implementation. In case of *signed*, all codeunits of any Unicode codepoint above U+007F will occupy the sign bit of each `char`. Also note that although `char` is guaranteed to be 1 byte in size, a byte is not guaranteed to be 8 bits on all platforms (though on most, it is) - see `CHAR_BIT` in `limits.h`. UTF-7, on the other hand, would fit nicely in a `char` string without using the sign bit at all. – Remy Lebeau Oct 02 '19 at 23:10

score 0 · Answer 3 · answered Oct 02 '19 at 23:29

You can check out the UTF-8 standard on Wiki. Not every sequence of bytes is a valid UTF-8 character. Even if it's a single byte: 0x11111000, 0x11111111 are not valid first bytes in UTF-8.

Though, I doubt that it is a good idea to use a non-UTF-8 character as a delimiter. You might find certain program (like Notepad++) having issues with reading output of your strings.

Is there such a thing as non-utf8 character

3 Answers3

Linked