I need UTF8 encoded representation of a hex string, not UTF16

Question

I need to get UTF8 representation of the following hex value, not UTF16. I am using C++ builder 11

setlocale(LC_ALL, ".UTF8");
String tb64 = UTF8String(U"D985");//Hex value of the letter م or M in arabic

 std::wstring hex;
for(int i =1; i < tb64.Length()+1; ++i)
        hex += tb64[i];

int len = hex.length();
std::wstring newString;
std::wstring byte;
String S;

for(int i=0; i< len; i+=4)
{

 byte = hex.substr(i,4);

 wchar_t  chr =( wchar_t ) ( int) wcstol(byte.c_str(), 0, 16);
     newString.push_back(chr);
     S = newString.c_str();
}

the output should be م which is M in Arabic not garbage

https://dencode.com/en/string?v=D985&oe=UTF-8&nl=crlf

You are doing something wrong and ask to fix your incorrect solution. This is an example of [XY problem](https://xyproblem.info/). Please explain : WHY? What are you trying to achieve, what are the requirements not the way you try to achieve them. There are many simpler ways to get this character. — Marek R, Dec 07 '21 at 10:17
https://stackoverflow.com/questions/215963/how-do-you-properly-use-widechartomultibyte/215973 — VLL, Dec 07 '21 at 10:20
There is `MultiByteToWideChar` function in Win32 API. With `CP_UTF8` for first parameter. — i486, Dec 07 '21 at 10:22
unicode map says it's `0x0645`. Can't you just output `م`? Where do you get your input from? — Sergey Kolesnik, Dec 07 '21 at 11:18

Remy Lebeau · Accepted Answer · 2022-02-07T04:01:20.383

You are assigning the hex string to a UTF8String, and then assigning that to a (Unicode)String, which will convert the UTF-8 to UTF-16. Then you are creating a separate std::wstring from the UTF-16 characters. std::wstring uses UTF-16 on Windows and UTF-32 on other platforms.

All of those string conversions are unnecessary, since you are dealing with hex characters in the ASCII range. So just iterate the characters of the original hex string as-is, no conversion needed.

In any case, you are trying to decode each 4-digit hex sequence directly into a binary Unicode codepoint number. But in this case, codepoint U+D985 is not a valid Unicode character.

"D985" is actually the hex-encoded UTF-8 bytes of the Unicode character م (codepoint U+0645), so you need to convert each pair of 2 hex digits into a single byte, and store the bytes as-is into a UTF8String, not a std::wstring.

The RTL has a StrToInt() function that can decode a hex-encoded UnicodeString into an integer, which you can then treat as a byte in this case.

Try something more like this instead:

String hex = _D("D985");
int len = hex.Length();

UTF8String utf8;
for(int i = 1; i <= len; i += 2) {
    utf8 += static_cast<char>(StrToInt(_D("0x") + hex.Substring(i, 2)));
}

/* alternatively:
UTF8String utf8;
utf8.SetLength(len / 2);

for(int i = 1, j = 1; i <= len; i += 2, ++j) {
    utf8[j] = static_cast<char>(StrToInt(_D("0x") + hex.Substring(i, 2)));
}
*/

// use utf8 as needed...

If you need to convert the decoded UTF-8 to UTF-16, just assign the UTF8String as-is to a UnicodeString, eg:

UnicodeString utf16 = utf8;

Or, you can alternatively store the decoded bytes into a TBytes and then use the GetString() method of TEncoding::UTF8, eg:

String hex = _D("D985");
int len = hex.Length();

TBytes utf8;
utf8.Length = len / 2;
for(int i = 1, j = 0; i <= len; i += 2, ++j) {
    utf8[j] = static_cast<System::Byte>(StrToInt(_D("0x") + hex.Substring(i, 2)));
}

UnicodeString utf16 = TEncoding::UTF8->GetString(utf8);
// use utf16 as needed...

I just thought of a slightly simpler solution - the RTL also has a HexToBin() function, which can decode an entire hex-encoded string into a full byte array in one operation, eg:

String hex = _D("D985");

UTF8String utf8;
utf8.SetLength(hex.Length() / 2);
HexToBin(hex.c_str(), &utf8[1], utf8.Length());

/* or:
TBytes utf8;
utf8.Length = hex.Length() / 2;
HexToBin(hex.c_str(), &utf8[0], utf8.Length);
*/

// use utf8 as needed...

Note `D985` is a sequence of bytes which interpreted as UTF-8 represents desired character. So conversion to `int` from this hex value is completely off track and will provide invalid value. — Marek R, Dec 07 '21 at 17:22
@MarekR that is incorrect. `"D985"` in this case is a **hex encoded string representation** of the raw UTF-8 bytes. The code I presented is converting the substring `"0xD9"` into a binary byte whose numeric value is `0xD9`, then the substring `"0x85"` into a binary byte whose numeric value is `0x85`, and so on. Thus retrieving the actual raw UTF-8 bytes from the hex string. — Remy Lebeau, Dec 07 '21 at 17:30

I need UTF8 encoded representation of a hex string, not UTF16

1 Answers1