2

I would like to understand how regular std::string and std::map operations deal with Unicode code units should they be present in the string.

Sample code:

    include <iostream>
    #include "sys/types.h"

    using namespace std;

    int main()
    {

        std::basic_string<u_int16_t> ustr1(std::basic_string<u_int16_t>((u_int16_t*)"ยฤขฃ", 4));
        std::basic_string<u_int16_t> ustr2(std::basic_string<u_int16_t>((u_int16_t*)"abcd", 4));

        for (int i = 0; i < ustr1.length(); i++)
            cout << "Char: " << ustr1[i] << endl;

        for (int i = 0; i < ustr2.length(); i++)
            cout << "Char: " << ustr2[i] << endl;

        if (ustr1 == ustr2)
            cout << "Strings are equal" << endl;

        cout << "string length: " << ustr1.length() << "\t" << ustr2.length() << endl;
        return 0;
    }

The strings contain Thai characters and ascii characters, and the intent behind using basic_string<u_int16_t> is to facilitate storage of characters which cannot be accommodated within a single byte. The code was run on a Linux box, whose encoding type is en_US.UTF-8. The output is:

$ ./a.out
Char: 47328
Char: 57506
Char: 42168
Char: 47328
Char: 25185
Char: 25699
Char: 17152
Char: 24936
string length: 4        4

A few questions:

  1. Do the character values in the output correspond to en_US.UTF-8 code points? If not, what are they?

  2. Would the std::string operators like ==, !=, < etc., be able to work with Unicode code points? If so, would it be a mere comparison of each code points in the corresponding locations? Would std::map work on similar lines?

  3. Would changing the locale to UTF-16 result in the strings getting stored as UTF-16 code points?

Thanks!

Maddy
  • 1,319
  • 3
  • 22
  • 37

1 Answers1

8

I would like to understand how regular std::string and std::map operations deal with Unicode code units should they be present in the string.

They don't.

std::string is a sequence of chars or bytes. It is not a "high-level" string taking any encoding into account. You must do that yourself, e.g. by using a library dedicated to that purpose such as ICU.

Switching from std::string (i.e. std::basic_string<char>) to std::basic_char<u_int16_t> doesn't change that; it just means you have a sequence of "wide" characters instead.

And std::map has nothing to do with this at all.

Further reading:

Community
  • 1
  • 1
Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
  • Thanks for the clarification. If UTF-16 encoded strings (containing non-ascii characters) are stored in `std::basic_char` type, how do the string operations like `==`, `!=`, `<` etc fare? I'm assuming they'd fail on Linux whose encoding type is something else. But if the encoding type of the string and of the platform is the same, what would happen? If the `==` operation doesn't work in this case, I'm curious to know why. – Maddy Apr 21 '16 at 04:49
  • I meant, how do the string operations *on* these strings fare? – Maddy Apr 21 '16 at 05:02
  • @Maddy: It's not clear what you're asking. What do you mean how do they "fare"? They fare excellently. They perform exactly the operation they are designed and specified to perform; that is, operations of a sequence of `char`/`u_int16_t`, with no regard for encoding whatsoever. Which is what I said in my answer. But I don't understand why you think `==` would ever fail to perform its job of checking for equality? – Lightness Races in Orbit Apr 21 '16 at 10:22
  • Consider the case where `std::basic_char` stores an UTF-16 encoded string. What is essentially stored in memory are the code units corresponding to the characters, correct? If so, for `==` operation, the values (code units, in this case) in the corresponding memory locations are pulled up for comparison, and are deemed equal or unequal. Is the understanding correct so far? – Maddy Apr 22 '16 at 04:22
  • For instance, the UTF-16LE code unit for `` is `52 D8 62 DF` (surrogate pairs). – Maddy Apr 22 '16 at 04:32
  • @Maddy: Yes? I'm not understanding what else it could possibly mean. `==` performs an equality comparison. It checks whether things are equal. – Lightness Races in Orbit Apr 22 '16 at 08:52
  • Thanks for the clarification! The discussion, though elementary, was useful :) – Maddy Apr 22 '16 at 14:40