2

Is there any difference between the followings?

auto s1 = L"你好";
auto s2 = u8"你好";

Are s1 and s2 referring to the same type?
If no, what's the difference and which one is preferred?

MBZ
  • 26,084
  • 47
  • 114
  • 191

3 Answers3

4

L"" creates a null-terminated string, of type const wchar_t[]. This is valid in C++03. (Note that wchar_t refers to an implementation-dependent "wide-character" type).

u8"" creates a null-terminated UTF-8 string, of type const char[]. This is valid only in C++11.

Which one you choose is strongly dependent on what needs you have. L"" works in C++03, so if you need to work with older code (which may need to be compiled with a C++03 compiler), you'll need to use that. u8"" is easier to work with in many circumstances, particularly when the system in question normally expects char * strings.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
4

They are not the same type.

s2 is a UTF-8 or narrow string literal. The C++11 draft standard section 2.14.5 String literals paragraph 7 says:

A string literal that begins with u8, such as u8"asdf", is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.

And paragraph 8 says:

Ordinary string literals and UTF-8 string literals are also referred to as narrow string literals. A narrow string literal has type “array of n const char, where n is the size of the string as defined below, and has static storage duration (3.7).

s1 is a wide string literal which can support UTF-16 and UTF-32. Section 2.14.5 String literals paragraph 11 says:

A string literal that begins with L, such as L"asdf", is a wide string literal. A wide string literal has type “array of n const wchar_t, where n is the size of the string as defined below; it has static storage duration and is initialized with the given characters.

See UTF8, UTF16, and UTF32 for a good discussion on the differences and advantages of each.

A quick way to determine types is to use typeid:

std::cout << typeid(s1).name() << std::endl ;
std::cout << typeid(s2).name() << std::endl ;

On my system this is the output:

PKw
PKc

Checking each of these with c++filt -t gives me:

wchar_t const*
char const*
Community
  • 1
  • 1
Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
  • 1
    @r-martinho-fernandes My original markup was pretty bad and I appreciate the edit it looks much better. I was way too tired when I posted, but the comment with the edit was uncalled for and if it was anyone else I would have flagged it. – Shafik Yaghmour Aug 20 '13 at 09:54
3

The first is a wide character string, which might be encoded as UTF-16 or UTF-32, or something else entirely (though Unicode is now common enough that a completely different encoding is pretty unlikely).

The second is a string of narrow characters using UTF-8 encoding.

As to which is preferred: it'll depend on what you're doing, what platform you're coding for, etc. If you're mostly dealing with something like a web page/URL that's already encoded as UTF-8, and you'll probably just read it in, possibly verify its content, and later echo it back, it may well make sense to store it as UTF-8 as well.

Wide character strings vary by platform. If, for one example, you're coding for Windows, and a lot of the code interacts directly with the OS (which uses UTF-16) then storing your strings as UTF-16 can make a great deal of sense (and that's what Microsoft's compiler uses for wide character strings).

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111