3

What is the encoding of unprefixed string literals in C++? For example, all string literals are parsed and stored as UTF-16 in Java, as UTF-8 in Python3. I guess this is the case with C++ u8"" literals. But I'm not clear about normal literals like "".

What should be the output of following code?

#include <iostream>
#include <iomanip>

int main() {
    auto c = "Hello, World!";
    while(*c) {
        std::cout << std::hex << (unsigned int){*c++} << " ";
    }
}

When I run this in my machine, it gives following output:

48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21 

But is this guarantied? Cppreference page for string literals says that characters inside normal string literals are from the translation character set, and translation character set states that:

The translation character set consists of the following elements:

  • each character named by ISO/IEC 10646, as identified by its unique UCS scalar value, and
  • a distinct character for each UCS scalar value where no named character is assigned.

From this definition, it seems translation character set refers to Unicode (or its superset). Then is there no difference between "" and u8"" except for explicitness?

Suppose if I want my string to be in EBCDIC encoding (just as an exercise), what is the correct way to achieve it in C++?

EDIT: The linked Cppreference page for string literals does say that it is implementation defined. Does that mean, should I avoid using them?

Sourav Kannantha B
  • 2,860
  • 1
  • 11
  • 35
  • 1
    The encoding of unprefixed string literals in C++ is platform dependent. If your platform is Windows, they could be Win1252 assuming your text editor is saving files in Win1252. Maybe your platform is Win1252, MacRoman, ASCII, ISO 8859-1, EBCDIC, GB18030, DS9K, or something else. – Eljay Jan 19 '23 at 12:36
  • 2
    from the cppreference page you link. "The encoding of ordinary string literals (1) and wide string literals (2) is implementation-defined. " – 463035818_is_not_an_ai Jan 19 '23 at 12:38
  • @Eljay Platform dependent in the sense, is it determined by compiler? or by the operating system? – Sourav Kannantha B Jan 19 '23 at 12:38
  • 3
    afaik implementation defined means its up to the compiler, but it must be documented. – 463035818_is_not_an_ai Jan 19 '23 at 12:39
  • @463035818_is_not_a_number So, I should not be using them if I am writing a library? or something that is multiplatform – Sourav Kannantha B Jan 19 '23 at 12:40
  • why not? You should not assume that their encoding is the same everywhere. – 463035818_is_not_an_ai Jan 19 '23 at 12:40
  • You can safely expect ASCII. All this talk about it not being cross-platform is mostly a historical curiosity. But for symbols outside ASCII range things get more interesting. – HolyBlackCat Jan 19 '23 at 12:40
  • 1
    It's determined by your platform. Let's say your platform is Amiga, then your files are ECMA-94 Latin 1. If you do an encoding that is not your platform's encoding, you'll have to take steps to use a different encoding. For example, let's say you use EBCDIC on an Amiga, then you are responsible for all the effort to make that work. – Eljay Jan 19 '23 at 12:41
  • @HolyBlackCat So, if I am working with something non-ascii, then should I avoid using unprefixed literals? – Sourav Kannantha B Jan 19 '23 at 12:43
  • 1
    Another SO answer that might help you [how do I properly use std::string on utf8](https://stackoverflow.com/questions/50403342/how-do-i-properly-use-stdstring-on-utf-8-in-c). – Pepijn Kramer Jan 19 '23 at 12:48
  • What do you mean by *"working with something non-ascii"*? Is your source file actually in a non-ASCII encoding? Note that your source file looks the same in both ASCII and UTF-8, since the former is a subset of the latter, and you don't have any symbols not representable in ASCII – HolyBlackCat Jan 19 '23 at 12:49
  • Have a read of [Character sets and encodings - Code unit and literal encoding](https://en.cppreference.com/w/cpp/language/charset#Code_unit_and_literal_encoding) and if you understand it please explain it to me. – Richard Critten Jan 19 '23 at 12:57
  • It is not guaranteed. See [\[lex.string\]](http://eel.is/c++draft/lex.string#10.1). It's implementation-defined. However, using C++20's string literal prefix `u8""`, it is guaranteed to be UTF-8. (Which happens to be the same as ASCII in this case: [compiler explorer](https://godbolt.org/z/6916z8drY)). – viraltaco_ Jan 19 '23 at 14:14
  • @HolyBlackCat My source file is in UTF-8 encoding. By _"working with non-ascii"_, I meant, I'm reading a file which may contain non-ascii symbols. I need to do pattern matching on that file. While defining patterns, should I just use `"foo"` or should I use `u8"foo"` every time in the source code? The file is guarenteed to be in UTF-8 btw. – Sourav Kannantha B Jan 19 '23 at 15:18
  • 1
    In practice it doesn't matter. Both `"foo"` and `u8"foo"` will be in utf-8 if the source is in utf-8, assuming the compiler flags don't override that (MSVC also requries `/utf-8`, it seems - if your source file contains something that doesn't fit in ASCII, otherwise it doesn't matter). – HolyBlackCat Jan 19 '23 at 15:20
  • Source file encoding and execution encoding need not be the same. – n. m. could be an AI Jan 19 '23 at 16:44
  • If you want to know how to work with non-ASCII characters, your question should be basically "how do I work with non-ASCII characters?" – n. m. could be an AI Jan 19 '23 at 16:45
  • @n.m. Yes, I want to know how to work with non-ascii. But I asked this question specifically to clarify myself about string encodings in C++. Anyways, link in Marek's answer tells about using non-ascii in C++. – Sourav Kannantha B Jan 19 '23 at 16:50

1 Answers1

3

Encoding of string literals is controlled by compiler settings. Default settings depend on compiler. AFAIK by default MSVC uses encoding defined by system locale. On gcc/clang utf-8 is assumed.

In MSVC you can change this by using /execution-charset: switch. Gcc clang have -fexec-charset= switch.

Note you haveto instruct standard library what is current encoding of your string literals. This is one of features of std::locale::global.

Here is my other answer where I did some experiments with MSVC.

Marek R
  • 32,568
  • 6
  • 55
  • 140