37

From Wikipedia:

For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.

I'm wondering what exactly this means for writing portable applications. Is there any difference between writing this

const char[] str = "Test String";

or this?

const char[] str = u8"Test String";

Is there be any reason not to use the latter for every string literal in your code?

What happens when there are non-ASCII-Characters inside the TestString?

Lukas Schmelzeisen
  • 2,934
  • 4
  • 24
  • 30
  • 1
    http://stackoverflow.com/questions/9739070/char-encoding might be useful – Yakk - Adam Nevraumont Nov 18 '12 at 21:47
  • 9
    One of the strings is UTF-8, the other one could be anything, like EBCDIC. – Bo Persson Nov 18 '12 at 21:48
  • Maybe of interest -- some encoding-related questions of mine: [#1](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability), [#2](http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c0x), [#3](http://stackoverflow.com/questions/7562609/what-does-cuchar-provide-and-where-is-it-documented) – Kerrek SB Nov 18 '12 at 21:50

4 Answers4

34

The encoding of "Test String" is the implementation-defined system encoding (the narrow, possibly multibyte one).

The encoding of u8"Test String" is always UTF-8.

The examples aren't terribly telling. If you included some Unicode literals (such as \U0010FFFF) into the string, then you would always get those (encoded as UTF-8), but whether they could be expressed in the system-encoded string, and if yes what their value would be, is implementation-defined.

If it helps, imagine you're authoring the source code on an EBCDIC machine. Then the literal "Test String" is always EBCDIC-encoded in the source file itself, but the u8-initialized array contains UTF-8 encoded values, whereas the first array contains EBCDIC-encoded values.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
18

You quote Wikipedia:

For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be at least the size necessary to store an eight-bit coding of UTF-8.

Well, the “For the purpose of” is not true. char has always been guaranteed to be at least 8 bits, that is, CHAR_BIT has always been required to be ≥8, due to the range required for char in the C standard. Which is (quote C++11 §17.5.1.5/1) “incorporated” into the C++ standard.

If I should guess about the purpose of that change of wording, it would be to just clarify things for those readers unaware of the dependency on the C standard.

Regarding the effect of the u8 literal prefix, it

  • affects the encoding of the string in the executable, but

  • unfortunately it does not affect the type.

Thus, in both cases "tørrfisk" and u8"tørrfisk" you get a char const[n]. But in the former literal the encoding is whatever is selected for the compiler, e.g. with Latin 1 (or Windows ANSI Western) that would be 8 bytes for the characters plus a nullbyte, for array size 9. While in the latter literal the encoding is guaranteed to be UTF-8, where the “ø” will be encoded with 2 or 3 bytes (I don’t recall exactly), for a slightly larger array size.

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
10

If the execution character set of the compiler is set to UTF-8, it makes no difference if u8 is used or not, since the compiler converts the characters to UTF-8 in both cases.

However if the compilers execution character set is the system's non UTF8 codepage (default for e.g. Visual C++), then non ASCII characters might not properly handled when u8 is omitted. For example, the conversion to wide strings will crash e.g. in VS15:

std::string narrowJapanese("スタークラフト");
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convertWindows;
std::wstring wide = convertWindows.from_bytes(narrowJapanese); // Unhandled C++ exception in xlocbuf.
Roi Danton
  • 7,933
  • 6
  • 68
  • 80
  • 1
    So, lets just standardize the "execution character set" to utf-8, and job done. :) – Chef Gladiator Apr 01 '20 at 21:41
  • @ChefGladiator surprisingly enough, the "UTF-8 everywhere" Linux doesn't even have the ECS set to UTF-8 in the bare-bones "C" locale due to some glibc stubbornness... It does mostly work in terms of not mangling existing UTF-8 (because UTF-8 is designed well enough to make byte wise operations work), but absolutely fails when you call encoding conversion to the "ECS". – Mingye Wang Jun 09 '23 at 03:30
8

The compiler chooses a native encoding natural to the platform. On typical POSIX systems it will probably choose ASCII and something possibly depending on environment's setting for character values outside the ASCII range. On mainframes it will probably choose EBCDIC. Comparing strings received, e.g., from files or the command line will probably work best with the native character set. When processing files explicitly encoded using UTF-8 you are, however, probably best off using u8"..." strings.

That said, with the recent changes relating to character encodings a fundamental assumption of string processing in C and C++ got broken: each internal character object (char, wchar_t, etc.) used to represent one character. This is clearly not true anymore for a UTF-8 string where each character object just represents a byte of some character. As a result all the string manipulation, character classification, etc. functions won't necessarily work on these strings. We don't have any good library lined up to deal with such strings for inclusion into the standard.

kevinarpe
  • 20,319
  • 26
  • 127
  • 154
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • 1
    char has long been known to be possibly multi-byte (i.e., programmers that were assuming one char per character were doing it wrong). On the other hand wchar_t requires fixed width. Unfortunately Unicode fundamentally breaks assumptions about what 'fixed width' means. – bames53 Nov 18 '12 at 22:21
  • 1
    I'm not necessarily disagreeing with the fact that strings are used since quite a while to hold multi-byte encoding but the standard didn't acknowledge this fact and treated internal characters as one units. All of the standard facilities processing strings do still behave as if characters are just one unit! For example, it doesn't really make much sense to have `s.substr(b, n)` if the start and/or the end of the substring can be in the middle of a Unicode character. Even `wchar_t` strings have characters with fixed width as there are, e.g., combining characters. – Dietmar Kühl Nov 18 '12 at 22:27
  • 1
    @DietmarKühl: "but the standard didn't acknowledge this fact", i think you mean *in the library functions*. the c++ standard itself has always recognized the existence of multibyte (per character) strings. for example, it recommends/required (i don't recall exactly which) that `main` arguments are MBCSes, which is where the Windows convention fails -- or, where the standard failed to properly standardize existing practice... ;-) – Cheers and hth. - Alf Nov 18 '12 at 22:30
  • 1
    No, the standard acknowledges multibyte encodings, including in the library. For example the code conversion facets can handle illegal sequences, insufficient space to store the multibyte representation of a wide character, etc. Multiple chars-per-wchar_t are acknowledged and handled in many places. The issue with `s.substr(b,n)` isn't an issue with the library, it's an issue with the programmer believing that it operates at the character level rather than, as specified, at the code unit level. – bames53 Nov 18 '12 at 22:38
  • 1
    My comment about Unicode fundamentally breaking what 'fixed width' means was about the issue with combining characters, among other things. In light of Unicode wchar_t is pretty much [worthless](http://stackoverflow.com/questions/11107608/whats-wrong-with-c-wchar-t-and-wstrings-what-are-some-alternatives-to-wide). – bames53 Nov 18 '12 at 22:38
  • The standard clearly support internalizing multi-byte sequences, e.g., in the form of `mbstowcs()` or the `std::codecvt<...>` facets. Once internalized it happily provides functions which *seem* to work on characters, e.g., `isdigit()`, `strlen()` (on my system documented as "find length of string" not "number of bytes in encoding of string"), `std::string` members, etc. The library has no functionality to classify actual characters, find the number of characters in a `std::string`, lexicographically compare strings (I don't think `std::collate` does it), and so on. – Dietmar Kühl Nov 18 '12 at 22:49
  • I'd think we need functionality like this although it seems that other languages aren't much better (but I when I was brought up it wasn't acceptable to compare to equally bad or even worse peers, i.e., that doesn't mean anything to ). The problem seems to be that nobody involved with the C++ standard has the experience and the desire to standardize a suitable library! – Dietmar Kühl Nov 18 '12 at 22:52
  • Yes, there are functions which are hopelessly useless, but that's been true pretty much from the beginning; it's not new with any recent changes in character encoding (unless by 'recent' you mean over at least twenty years ago.) However it's very difficult to specify classifications that would be correct for all applications, or even what 'character' means for all applications. For example the Unicode text segmentation algorithms are only "guidelines" which are not appropriate for all applications. Of course providing such things would still be useful a lot of the time and a huge improvement. – bames53 Nov 19 '12 at 03:37
  • As for strlen, it's not specified to provide the length in terms of user perceived characters either (or inches, or ...). It is actually specified to return "the number of characters" [C11 7.24.6.3/3] and 3.7.1 gives the relevant definition "single-byte character" and the intro to 7.24.1 specifies that characters are treated as having the type `unsigned char`. I admit the specifications could be updated to be clearer though. I particularly dislike C++'s continued muddled use of 'character set' to mean variously 'abstract character repertoire' and 'character encoding scheme' at different times. – bames53 Nov 19 '12 at 03:50
  • wchar_t doesn't guarantee one character either. On Windows wchar_t is 16-bits, and gcc has a short wchar flag so it's only the basic multilingual plane. You should really assume that it *doesn't* represent a single character since this is not true for Chinese. Assume it's UTF-16 and avoid chopping strings up or defining lengths on arbitrary boundaries. C++11 recognizes this which is why explicit 16 and 32-bit chars turned up. – locka Aug 16 '16 at 09:45