Is there any difference between the followings?
auto s1 = L"你好";
auto s2 = u8"你好";
Are s1
and s2
referring to the same type?
If no, what's the difference and which one is preferred?
L""
creates a null-terminated string, of type const wchar_t[]
. This is valid in C++03. (Note that wchar_t
refers to an implementation-dependent "wide-character" type).
u8""
creates a null-terminated UTF-8 string, of type const char[]
. This is valid only in C++11.
Which one you choose is strongly dependent on what needs you have. L""
works in C++03, so if you need to work with older code (which may need to be compiled with a C++03 compiler), you'll need to use that. u8""
is easier to work with in many circumstances, particularly when the system in question normally expects char *
strings.
They are not the same type.
s2
is a UTF-8 or narrow string literal. The C++11 draft standard section 2.14.5 String literals paragraph 7 says:
A string literal that begins with
u8
, such asu8"asdf"
, is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
And paragraph 8 says:
Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n
const char
”, where n is the size of the string as defined below, and has static storage duration (3.7).
s1
is a wide string literal which can support UTF-16 and UTF-32. Section 2.14.5 String literals paragraph 11 says:
A string literal that begins with
L
, such asL"asdf"
, is a wide string literal. A wide string literal has type “array of nconst wchar_t
”, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.
See UTF8, UTF16, and UTF32 for a good discussion on the differences and advantages of each.
A quick way to determine types is to use typeid
:
std::cout << typeid(s1).name() << std::endl ;
std::cout << typeid(s2).name() << std::endl ;
On my system this is the output:
PKw
PKc
Checking each of these with c++filt -t
gives me:
wchar_t const*
char const*
The first is a wide character string, which might be encoded as UTF-16 or UTF-32, or something else entirely (though Unicode is now common enough that a completely different encoding is pretty unlikely).
The second is a string of narrow characters using UTF-8 encoding.
As to which is preferred: it'll depend on what you're doing, what platform you're coding for, etc. If you're mostly dealing with something like a web page/URL that's already encoded as UTF-8, and you'll probably just read it in, possibly verify its content, and later echo it back, it may well make sense to store it as UTF-8 as well.
Wide character strings vary by platform. If, for one example, you're coding for Windows, and a lot of the code interacts directly with the OS (which uses UTF-16) then storing your strings as UTF-16 can make a great deal of sense (and that's what Microsoft's compiler uses for wide character strings).