std::string
has no encoding of its own. It can easily hold characters encoded in ASCII, UTF-8, ISO-8859-x, Windows-125x, etc. They are just raw bytes, as far as std::string
is concerned. So, before you can print your output in ISO-8859-1 specifically, you need to first know what the std::string
is already holding so it can be converted to ISO-8859-1 if needed.
FYI, ffffffc3ffffff89
is simply the two char
values 0xc3 0x89
(the UTF-8 encoded form of É
) being sign-extended to 32 bits. Which means your compiler implements char
as a signed type rather than an unsigned
type. To eliminate the leading f
s, you need to cast each char
to unsigned char
before then casting to unsigned
. You also will need to account for unsigned
values < 10 so that the output is an even multiple of 2 hex digits per char
, eg:
strstream << std::hex << std::setw(2) << std::setfill('0') << static_cast<unsigned>(static_cast<unsigned char>(input[i]));
So, it appears that your std::string
is encoded in UTF-8. There are plenty of libraries available that can convert text from one encoding to another, such as ICU or ICONV. Or platform-specific APIs, like WideCharToMultiByte()
/MultiByteToWideChar()
on Windows, std::mbstowcs()
/std::wcstombs()
, etc (provided suitable locales are installed in the OS). But there is nothing really built-in to C++ for this exact UTF-8 to ISO-8859-1 conversion. Though, you could use the (deprecated) std::wstring_convert
to decode the UTF-8 std::string
to a UTF-16/32 encoded std::wstring
, or a UTF-16 encoded std::u16string
, at least. And then you can convert that to ISO-8859-1 using whatever library you want as needed.
Or, knowing that the input is UTF-8 and the output is ISO-8859-1, it is really not that hard to just convert the data manually, decoding the UTF-8 into codepoints, and then encoding those codepoints to bytes. Both encodings are well-documented and fairly easy to write code for without too much effort, eg:
size_t nextUtf8CodepointLen(const char* data)
{
unsigned char ch = static_cast<unsigned char>(*data);
if ((ch & 0x80) == 0) {
return 1;
}
if ((ch & 0xE0) == 0xC0) {
return 2;
}
if ((ch & 0xF0) == 0xE0) {
return 3;
}
if ((ch & 0xF8) == 0xF0) {
return 4;
}
return 0;
}
unsigned nextUtf8Codepoint(const char* &data, size_t &data_size)
{
if (data_size == 0) return -1;
unsigned char ch = static_cast<unsigned char>(*data);
size_t len = nextUtf8CodepointLen(data);
++data;
--data_size;
if (len < 2) {
return (len == 1) ? static_cast<unsigned>(ch) : 0xFFFD;
}
--len;
unsigned cp;
if (len == 1) {
cp = ch & 0x1F;
}
else if (len == 2) {
cp = ch & 0x0F;
}
else {
cp = ch & 0x07;
}
if (len > data_size) {
data += data_size;
data_size = 0;
return 0xFFFD;
}
for(size_t j = 0; j < len; ++j) {
ch = static_cast<unsigned char>(data[j]);
if ((ch & 0xC0) != 0x80) {
cp = 0xFFFD;
break;
}
cp = (cp << 6) | (ch & 0x3F);
}
data += len;
data_size -= len;
return cp;
}
std::string Helper::ToHex(const std::string &input) {
const char *data = input.c_str();
size_t data_size = input.size();
std::ostringstream oss;
unsigned cp;
while ((cp = nextUtf8Codepoint(data, data_size)) != -1) {
if (cp > 0xFF) {
cp = static_cast<unsigned>('?');
}
oss << std::hex << std::setw(2) << std::setfill('0') << cp;
}
return oss.str();
}
Online Demo