4

My config:

  • Compiler: gnu gcc 4.8.2
  • I compile with C++11
  • platform/OS: Linux 64bit Ubuntu 14.04.1 LTS

I have this method:

static inline std::u16string StringtoU16(const std::string &str) {
    const size_t si = strlen(str.c_str());
    char16_t cstr[si+1];
    memset(cstr, 0, (si+1)*sizeof(char16_t));
    const char* constSTR = str.c_str();
    mbstate_t mbs;
    memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
    size_t ret = mbrtoc16 (cstr, constSTR, si, &mbs);
    std::u16string wstr(cstr);
    return wstr;
}

I want a conversion between char to char16_T pretty much (via std::string and std::u16string to facilitate memory management) but regardless of the size of the input variable str, it will return the first character only. If str= "Hello" it will return "H". I am not sure what is wrong my my method. Value of ret is 1.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
mimosa
  • 125
  • 3
  • 13
  • Look at wstring_convert and codecvt's http://en.cppreference.com/w/cpp/locale/wstring_convert – JDS Sep 15 '14 at 07:49
  • 3
    `strlen(str.c_str())`... – T.C. Sep 15 '14 at 07:50
  • 2
    And `char16_t cstr[si+1];` - VLA is not valid C++. – T.C. Sep 15 '14 at 07:51
  • @JDS: There's unfortunately no facet that connects the "system" and the "UTF" world. (However, you *can* convert [within the "system" world](http://stackoverflow.com/a/25485477/596781) and within the UTF world using `wstring_convert`, if you're willing to perform some class design acrobatics.) – Kerrek SB Sep 15 '14 at 07:51
  • ...and finally, [`mbrtoc16`](http://en.cppreference.com/w/cpp/string/multibyte/mbrtoc16) converts only a single character. – T.C. Sep 15 '14 at 07:52
  • @Kerrek SB: when your system is UTF8 you can use codecvt_utf8_utf16 – JDS Sep 15 '14 at 07:55
  • @JDS: "When your system is UTF8", a lot of things become a lot easier, that's true. If you want to write general C++, though, you have more hoops to jump through. – Kerrek SB Sep 15 '14 at 08:13
  • To my knowledge, char16_t does not imply any specific character encoding. So, as it seems unimportant to you, which encoding for 16 bit chars you want, simply prepend a 0x00 in front of the 8 bit value. Else, you might want to add to your question which encoding you want. – BitTickler Apr 10 '15 at 00:05

2 Answers2

3

I didn't know mbrtoc16() can only handle one character at a time.. what a turtle. Here is then the code I generate, and works like a charm:

static inline std::u16string StringtoU16(const std::string &str) {
    std::u16string wstr = u"";
    char16_t c16str[3] = u"\0";
    mbstate_t mbs;
    for (const auto& it: str){
        memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
        memmove(c16str, u"\0\0\0", 3);
        mbrtoc16 (c16str, &it, 3, &mbs);
        wstr.append(std::u16string(c16str));
    }//for
    return wstr;
}

for its counterpart (when one way is needed, sooner or later the other way will be needed):

static inline std::string U16toString(const std::u16string &wstr) {
    std::string str = "";
    char cstr[3] = "\0";
    mbstate_t mbs;
    for (const auto& it: wstr){
        memset (&mbs, 0, sizeof (mbs));//set shift state to the initial state
        memmove(cstr, "\0\0\0", 3);
        c16rtomb (cstr, it, &mbs);
        str.append(std::string(cstr));
    }//for
    return str;
}

Be aware that c16rtomb will be lossy if a character cannot be converted from char16_t to char (might endup printing a bunch of '?' depending on your system) but it will work without complains.

mimosa
  • 125
  • 3
  • 13
0

mbrtoc16() converts a single character, and returns the number of multibyte characters that were consumed in order to convert the char16_t.

In order to effect this conversion, the general approach is:

A) call mbrtoc16().

B) save the converted character, skip the number of characters that were consumed.

C) Have you consumed the entire string you wanted to convert? If no, go back to step A.

Additionally, there could be conversion errors. You must check the return value from mbrtoc16() and do whatever you want to do, to handle conversion errors (the original multibyte string is note valid).

Finally, you should not assume what the maximum size of the char16_t string is going to be equal to or less than the size of the multibyte string. It probably is; but, in some weird locale I suppose that it can, theoretically, be more.

Sam Varshavchik
  • 114,536
  • 5
  • 94
  • 148