Convert C++ std::string to UTF-16-LE encoded string

Question

I've been searching for hours today and just can't find anything that works out for me. The one I've just had a look at, with no luck, is "How to convert UTF-8 encoded std::string to UTF-16 std::string".

My question is, with a brief explanation:

I want to make a valid NTLM hash in std C++, and I'm using OpenSSL's library to create the hash using its MD4 routines. I know how to do that, so does anyone know how to convert the std::string into a UTF-16 LE encoded string which I can pass to the MD4 functions to get a correct digest?

So, can I have a std::string which holds the char type, and convert it to a UTF16-LE encoded variable length std::string_type? Whether that be std::u16string, or std::wstring?

And would I use s.c_str() or s.data() and would the length() function report correctly in both cases?

Your title question is clear, your question body is not. Are you aware that UTF-16 is still variable-length? That you would hold a UTF-16 string in a `std::u16string`, not a `std::string`? -- Could you please *focus down* the question? It's a bit all over the place right now. — DevSolar, Oct 08 '18 at 13:44
Thank you DevSolar. You are right. It's late at night and I'm a bit frustrated, so that came out a bit of a mess. I am aware that UTF16 is variable length, so I'm looking for std::string to std::u16string or std::wstring (if that works). I think the better question is perhaps: can I have a std::string which holds the char type, and convert it to a UTF16-LE encoded variable length std::string_type? Whether that be std::u16string, or std::wstring. — JYG, Oct 08 '18 at 13:50
About the last question, `length()` will always correctly return the number of char-type elements in the string object -- `char` for `std::string`, `char16_t` for `std::u16string`, `wchar_t` for `std::wstring`. None of those (necessarily) equals the number of code units / code points, of course. ;-) — DevSolar, Oct 08 '18 at 14:51
That have to pass trough this steps utf8 -> mono-entity-unicode -> utf16 . No way you can 8 to 16 without knowing the codepoint. — Sandburg, Nov 18 '19 at 11:28

Galik · Answer 1 · 2018-12-09T17:01:21.100

1

I think something like this should do the trick:

std::string utf16_to_utf8(std::u16string const& s)
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t, 0x10ffff,
        std::codecvt_mode::little_endian>, char16_t> cnv;
    std::string utf8 = cnv.to_bytes(s);
    if(cnv.converted() < s.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::u16string utf8_to_utf16(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t, 0x10ffff,
        std::codecvt_mode::little_endian>, char16_t> cnv;
    std::u16string s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

Note: that std::wstring_convert is deprecated in C++17 but I still favor using it rather than a non-standard library given that it is portable, has no dependencies and will no doubt remain until replaced.

And, if all else fails, you can reimplement these same functions with alternative code without changing any other part of the application.

edited Dec 09 '18 at 17:01

answered Oct 08 '18 at 13:59

Galik

47,303
4
80
117

Hi Galik, thanks very much for taking the time to type this out. I tried it for hours, I googled, I went nuts... didn't work, even though everything was telling me this looked ideal and we were on the right track here. To be honest though, I don't understand the C++ documentation completely for codecvt or any conversions. I'm more of a C programmer who likes to use C++ features whenever possible. I agree about being in favour of using it than a non-standard library. It should be possible. – JYG Oct 09 '18 at 02:22
@JYG On my system this produces `UTF-16LE` encoding from `UTF-8`. I am running on an `x86` `CPU` which is *littleendian*. Are you running on a *bigendian* system? – Galik Oct 09 '18 at 06:11
@JYG I changed to code to explicitly specify `UTF-16le`, does that fix the issue? – Galik Oct 09 '18 at 07:21

score 0 · Accepted Answer · answered Oct 08 '18 at 13:52

0

Apologies, firsthand... this will be an ugly reply with some long code. I ended up using the following function, while effectively compiling in iconv into my windows application file by file :)

Hope this helps.

char* conver(const char* in, size_t in_len, size_t* used_len)
{
    const int CC_MUL = 2; // 16 bit
    setlocale(LC_ALL, "");
    char* t1 = setlocale(LC_CTYPE, "");
    char* locn = (char*)calloc(strlen(t1) + 1, sizeof(char));
    if(locn == NULL)
    {
        return 0;
    }

    strcpy(locn, t1);
    const char* enc = strchr(locn, '.') + 1;

#if _WINDOWS
    std::string win = "WINDOWS-";
    win += enc;
    enc = win.c_str();
#endif

    iconv_t foo = iconv_open("UTF-16LE", enc);

    if(foo == (void*)-1)
    {
        if (errno == EINVAL)
        {
            fprintf(stderr, "Conversion from %s is not supported\n", enc);
        }
        else
        {
            fprintf(stderr, "Initialization failure:\n");
        }
        free(locn);
        return 0;
    }

    size_t out_len = CC_MUL * in_len;
    size_t saved_in_len = in_len;
    iconv(foo, NULL, NULL, NULL, NULL);
    char* converted = (char*)calloc(out_len, sizeof(char));
    char *converted_start = converted;
    char* t = const_cast<char*>(in);
    int ret = iconv(foo,
                    &t,
                    &in_len,
                    &converted,
                    &out_len);
    iconv_close(foo);
    *used_len = CC_MUL * saved_in_len - out_len;

    if(ret == -1)
    {
        switch(errno)
        {
        case EILSEQ:
            fprintf(stderr,  "EILSEQ\n");
            break;
        case EINVAL:
            fprintf(stderr,  "EINVAL\n");
            break;
        }

        perror("iconv");
        free(locn);
        return 0;
    }
    else
    {
        free(locn);
        return converted_start;
    }
}

answered Oct 08 '18 at 13:52

Ferenc Deak

34,348
17
99
167

Link to [iconv](https://en.wikipedia.org/wiki/Iconv) plus the necessary includes would also improve this answer. – DevSolar Oct 08 '18 at 14:47
Thanks fritzone! I'd been banging my head for hours trying to get iconv() working until I gave up and came back to have another look :) Thanks very much, now the ntlm hashes are correct every time after the proper conversion. Who cares if it's not "great" code, it works! – JYG Oct 08 '18 at 18:15
@DevSolar this is just a function which I have implemented in one of my really old more experimental projects... which was not very well commented unfortunately since it was in the class of the home grown pet projects.... so I sort of forgot what and why, I just know that it well... works. – Ferenc Deak Oct 08 '18 at 18:50
Hi DevSolar, I just copied and pasted this in above main() and added inline to the function signature. To use this, #include and call it like this: – JYG Oct 09 '18 at 02:24
Hi DevSolar, I just copied and pasted this in above main() and added inline to the function signature. To use this, #include and call it like this: char pass[64]; strcpy(pass, "p4ssw0rd"); size_t used_bytes = 64*3; char *unicode_password = conver(pass, strlen(pass), &used_bytes); /* Now make an NTLM hash */ MD4_CTX ctx; MD4_Init(&ctx); MD4_Update(&ctx, unicode_password, used_bytes); MD4_Final(message_digest_somewhere, &ctx); Install libiconv, compile: g++ -o program program.cpp -lcrypto -liconv, I've added the lib for openssl functions there too. Also free(unicode_password). – JYG Oct 09 '18 at 02:31
OK, I looked a bit deeper into this and... just no. For one, it converts from whatever the current environment locale is (`setlocale( LC_ALL, "" )`), not UTF-8 explicitly. Returning UTF-16 in a `char *` is misleading at best -- there is `char16_t *` for UTF-16 and `unsigned char *` for a byte sequence. But the real clincher is that `CC_MUL` business -- if your conversion results in one or more UTF-16 surrogate pairs, your code simply breaks. Sorry, together with the whole comment / documentation issue, I can't help but downvote this. – DevSolar Oct 09 '18 at 07:25

Convert C++ std::string to UTF-16-LE encoded string

2 Answers2

Linked