You are assigning the hex string to a UTF8String
, and then assigning that to a (Unicode)String
, which will convert the UTF-8 to UTF-16. Then you are creating a separate std::wstring
from the UTF-16 characters. std::wstring
uses UTF-16 on Windows and UTF-32 on other platforms.
All of those string conversions are unnecessary, since you are dealing with hex characters in the ASCII range. So just iterate the characters of the original hex string as-is, no conversion needed.
In any case, you are trying to decode each 4-digit hex sequence directly into a binary Unicode codepoint number. But in this case, codepoint U+D985 is not a valid Unicode character.
"D985"
is actually the hex-encoded UTF-8 bytes of the Unicode character م
(codepoint U+0645), so you need to convert each pair of 2 hex digits into a single byte, and store the bytes as-is into a UTF8String
, not a std::wstring
.
The RTL has a StrToInt()
function that can decode a hex-encoded UnicodeString
into an integer, which you can then treat as a byte in this case.
Try something more like this instead:
String hex = _D("D985");
int len = hex.Length();
UTF8String utf8;
for(int i = 1; i <= len; i += 2) {
utf8 += static_cast<char>(StrToInt(_D("0x") + hex.Substring(i, 2)));
}
/* alternatively:
UTF8String utf8;
utf8.SetLength(len / 2);
for(int i = 1, j = 1; i <= len; i += 2, ++j) {
utf8[j] = static_cast<char>(StrToInt(_D("0x") + hex.Substring(i, 2)));
}
*/
// use utf8 as needed...
If you need to convert the decoded UTF-8 to UTF-16, just assign the UTF8String
as-is to a UnicodeString
, eg:
UnicodeString utf16 = utf8;
Or, you can alternatively store the decoded bytes into a TBytes
and then use the GetString()
method of TEncoding::UTF8
, eg:
String hex = _D("D985");
int len = hex.Length();
TBytes utf8;
utf8.Length = len / 2;
for(int i = 1, j = 0; i <= len; i += 2, ++j) {
utf8[j] = static_cast<System::Byte>(StrToInt(_D("0x") + hex.Substring(i, 2)));
}
UnicodeString utf16 = TEncoding::UTF8->GetString(utf8);
// use utf16 as needed...
I just thought of a slightly simpler solution - the RTL also has a HexToBin()
function, which can decode an entire hex-encoded string into a full byte array in one operation, eg:
String hex = _D("D985");
UTF8String utf8;
utf8.SetLength(hex.Length() / 2);
HexToBin(hex.c_str(), &utf8[1], utf8.Length());
/* or:
TBytes utf8;
utf8.Length = hex.Length() / 2;
HexToBin(hex.c_str(), &utf8[0], utf8.Length);
*/
// use utf8 as needed...