I would like to understand how regular std::string
and std::map
operations deal with Unicode code units should they be present in the string.
Sample code:
include <iostream>
#include "sys/types.h"
using namespace std;
int main()
{
std::basic_string<u_int16_t> ustr1(std::basic_string<u_int16_t>((u_int16_t*)"ยฤขฃ", 4));
std::basic_string<u_int16_t> ustr2(std::basic_string<u_int16_t>((u_int16_t*)"abcd", 4));
for (int i = 0; i < ustr1.length(); i++)
cout << "Char: " << ustr1[i] << endl;
for (int i = 0; i < ustr2.length(); i++)
cout << "Char: " << ustr2[i] << endl;
if (ustr1 == ustr2)
cout << "Strings are equal" << endl;
cout << "string length: " << ustr1.length() << "\t" << ustr2.length() << endl;
return 0;
}
The strings contain Thai characters and ascii characters, and the intent behind using basic_string<u_int16_t>
is to facilitate storage of characters which cannot be accommodated within a single byte. The code was run on a Linux box, whose encoding type is en_US.UTF-8
. The output is:
$ ./a.out
Char: 47328
Char: 57506
Char: 42168
Char: 47328
Char: 25185
Char: 25699
Char: 17152
Char: 24936
string length: 4 4
A few questions:
Do the character values in the output correspond to
en_US.UTF-8
code points? If not, what are they?Would the
std::string
operators like==
,!=
,<
etc., be able to work with Unicode code points? If so, would it be a mere comparison of each code points in the corresponding locations? Wouldstd::map
work on similar lines?Would changing the locale to UTF-16 result in the strings getting stored as UTF-16 code points?
Thanks!