wchar_t
is used in Windows which uses UTF16-LE format. wchar_t
requires wide char functions. For example wcslen(const wchar_t*)
instead of strlen(const char*)
and std::wstring
instead of std::string
Unix based machines (Linux, Mac, etc.) use UTF8. This uses char
for storage, and the same C and C++ functions for ASCII, such as strlen(const char*)
and std::string
(see comments below about std::find_first_of
)
wchar_t
is 2 bytes (UTF16) in Windows. But in other machines it is 4 bytes (UTF32). This makes things more confusing.
For UTF32, you can use std::u32string
which is the same on different systems.
You might consider converting UTF8 to UTF32, because that way each character is always 4 bytes, and you might think string operations will be easier. But that's rarely necessary.
UTF8 is designed so that ASCII characters between 0 and 128 are not used to represent other Unicode code points. That includes escape sequence '\'
, printf
format specifiers, and common parsing characters like ,
Consider the following UTF8 string. Lets say you want to find the comma
std::string str = u8"汉,"; //3 code points represented by 8 bytes
The ASCII value for comma is 44
, and str
is guaranteed to contain only one byte whose value is 44
. To find the comma, you can simply use any standard function in C or C++ to look for ','
To find 汉
, you can search for the string u8"汉"
since this code point cannot be represented as a single character.
Some C and C++ functions don't work smoothly with UTF8. These include
strtok
strspn
std::find_first_of
The argument for above functions is a set of characters, not an actual string.
So str.find_first_of(u8"汉")
does not work. Because u8"汉"
is 3 bytes, and find_first_of
will look for any of those bytes. There is a chance that one of those bytes are used to represent a different code point.
On the other hand, str.find_first_of(u8",;abcd")
is safe, because all the characters in the search argument are ASCII (str
itself can contain any Unicode character)
In rare cases UTF32 might be required (although I can't imagine where!) You can use std::codecvt
to convert UTF8 to UTF32 to run the following operations:
std::u32string u32 = U"012汉"; //4 code points, represented by 4 elements
cout << u32.find_first_of(U"汉") << endl; //outputs 3
cout << u32.find_first_of(U'汉') << endl; //outputs 3
Side note:
You should use "Unicode everywhere", not "UTF8 everywhere".
In Linux, Mac, etc. use UTF8 for Unicode.
In Windows, use UTF16 for Unicode. Windows programmers use UTF16, they don't make pointless conversions back and forth to UTF8. But there are legitimate cases for using UTF8 in Windows.
Windows programmer tend to use UTF8 for saving files, web pages, etc. So that's less worry for non-Windows programmers in terms of compatibility.
The language itself doesn't care which Unicode format you want to use, but in terms of practicality use a format that matches the system you are working on.