3

I want to find the default encoding format about std:: string.
I am trying to find out the encoding format, but I have no idea. Does std:: string in c++ has encoding format ?

xuwang
  • 33
  • 1
  • 3
  • `std::string` doesn't really have any "encoding", it's just a collection of `char` elements. It's up to you what you put into it. – Some programmer dude Oct 29 '19 at 11:57
  • Thanks, by the way, how to convert std:: string to UTF-8? – xuwang Oct 29 '19 at 12:00
  • 1
    Are you sure that the string isn't already in UTF-8? You can store ASCII encoded strings in a `std::string`, or EBCDIC encoded strings, or UTF-8 encoded strings, or any encoding scheme that uses bytes. The `std::string` and the data it wraps doesn't have any specific encoding, it's how *you* treat the data that makes it encoded in one way or another. – Some programmer dude Oct 29 '19 at 12:02
  • I find that std:: string`s value type is char. So I want to know can I convert any char(-127-128) to UTF-8? – xuwang Oct 29 '19 at 12:19
  • 1
    If you know anything about UTF-8 then you should know that it's partially based on 7-bit ASCII, and 7-bit ASCII have values from `0` to `127` (inclusive) for their characters. Characters not fitting in ASCII will need multiple bytes to encode, but they are still *bytes* and can therefore be encoded and stored in a `std::string` (as long as `char` is a byte, which it is on all PC-like systems). – Some programmer dude Oct 29 '19 at 12:29
  • Does this answer your question? [What encoding does std::string.c\_str() use?](https://stackoverflow.com/questions/1010783/what-encoding-does-stdstring-c-str-use) – phuclv Nov 27 '19 at 13:03

2 Answers2

8

The simple answer

std::string is defined as std::basic_string<char> which means it is a collection of chars. As a collection of chars it can potentially hold chars that are the encoded result of a utf8 string.

The following code is valid till C++20:

std::string s = u8"1 שלום Hello";
std::cout << s << std::endl;

And it prints, in a console that supports it:

1 שלום Hello

The u8 before the parenthesized string is the string literal for utf8 telling the compiler that the following parenthesized string has utf8 encoding.

Without the u8 prefix notation the compiler would take the string based on the source encoding of the compiler, so if the default encoding or the encoding explicitly set for the compiler supports the chars in the string it can take it also like this:

std::string s = "1 שלום Hello";
std::cout << s << std::endl;

with the same output as above. However this is platform and compiler dependent.

If the source encoding of the compiler doesn't support these chars, for example if we are setting in gcc the source encoding to LATIN with the flag -fexec-charset=ISO-8859-1 the string without u8 prefix gives the following compilation error:

converting to execution character set:
Invalid or incomplete multibyte or wide character 
    std::string s = "1 שלום Hello";
                     ^~~~~~~~~~~~~~

Since C++20 u8 parenthesized string cannot be converted into std::string:

std::string s = u8"1 שלום Hello";
std::cout << s << std::endl;

gives the following compilation error in C++20:

conversion from 'const char8_t [17]' to non-scalar type 'std::string'
{aka 'std::__cxx11::basic_string<char>'} requested
    std::string s = u8"1 שלום Hello";
                    ^~~~~~~~~~~~~~~~~

This is because the type of u8 parenthesized string in C++20 is not const char[SIZE] but rather const char8_t[SIZE] (the type char8_t was introduced in C++20).

You can use however in C++20 the new type std::u8string:

std::u8string s = u8"1 שלום Hello"; // good - std::u8string added in C++20
// std::cout << s << std::endl; // oops, std::ostream doesn't support u8string

A few interesting notes:

  1. till C++20 a u8 parenthesized string is const char[SIZE]
  2. from C++20 a u8 parenthesized string is const char8_t[SIZE]
  3. the size of char8_t is the same as char, but it is a distinct type

The long story

Encoding is a sad story in C++. This is probably why there is no "simple answer" for your question. There isn't still a fully fledged end-to-end standard solution for handling character encoding. There are std converters, 3rd party libraries etc. But not a real tight and simple solution. Hopefully C++23 would solve this.

See CppCon 2019 session on the subject, by JeanHeyd Meneide

Also a related question: how std::u8string will be different from std::string?

Amir Kirsh
  • 12,564
  • 41
  • 74
0

std::string is a container of char and nothing enforces any particular encoding. Some programmers use it to hold text encoded according to the locale dependent character set while others use it for holding text encoded as UTF-8 or some other encoding. The locale dependent character set is the one associated with the "C" locale by default, but can be changed by a call to std::setlocale. A call to std::setlocale(LC_CTYPE, "") will set the locale character set according to the system defined locale (as indicated by the LANG, LC_ALL, or LC_CTYPE environment variables on POSIX systems, or by the Active Code Page (ACP) on Windows). These locale settings affect the behavior of a few C and C++ interfaces, mainly the character classification functions.

On POSIX systems, you can query the name of the locale dependent character encoding with a call like nl_langinfo(CODESET). On Windows, you can query the ACP by calling getACP().

My recommendation is, unless additional information (documentation or other out of band data) indicates a different encoding, to assume that std::string contents are encoded according to locale settings.

Tom Honermann
  • 1,774
  • 1
  • 7
  • 10