0

Referring to the ISO-8859-1 (Latin-1) encoding:

image

The capital E acute (É) has a hex value of C9.

I am trying to write a function that takes a std::string and then converts it to hex according to the ISO-8859-1 encoding above.

Currently, I am only able to write a function that converts an ASCII string to hex:

std::string Helper::ToHex(std::string input) {
    std::stringstream strstream;
    std::string output;
    for (int i=0; i<input.length(); i++) {
        strstream << std::hex << unsigned(input[i]);
    }
    strstream >> output;
}

However, this function can't do the job when the input has accented characters. It will convert É to a hex value of ffffffc3ffffff89.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
zfgo
  • 195
  • 11
  • Please explain "this function can't do the job". What does it do? – 273K Aug 19 '21 at 15:45
  • @S.M. updated in the description, thanks – zfgo Aug 19 '21 at 15:53
  • You have to cast it to `unsigned char`, not `unsigned`. (which would get you c389 because your string is not in iso-8859-1) –  Aug 19 '21 at 15:57
  • Your string is encoded as UTF-8. not ISO 8859-1. If you want ISO 8859-1 encoding, you need to re-encode it. – n. m. could be an AI Aug 19 '21 at 15:57
  • @dratenik, how do I encode my string to iso-8859-1? – zfgo Aug 19 '21 at 16:00
  • 1
    There are libraries like iconv or ICU which you can use for such a conversion. –  Aug 19 '21 at 16:01
  • @dratenik since ISO-8859-1 is used for the first 256 code points of Unicode, there's no need for a full blown library. A simple function that decodes UTF-8 to Unicode will do. See e.g. [UTF8 to/from wide char conversion in STL](https://stackoverflow.com/q/148403/5987). – Mark Ransom Aug 19 '21 at 16:06
  • @dratenik "*You have to cast it to `unsigned char`, not `unsigned`*" - actually, you need both. `unsigned char` to avoid the sign extension issue, and `unsigned` so `std::hex` will take effect (because `operator<<` treats an `unsigned char` as text, not as an integer). – Remy Lebeau Aug 19 '21 at 16:29
  • @MarkRansom "*since ISO-8859-1 is used for the first 256 code points of Unicode*" - technically, ISO-8859-1 only covers 191 of the 1st 256 Unicode codepoints, there are gaps in the coverage. But of those 191, the bytes and codepoints are 1:1 in value, yes – Remy Lebeau Aug 19 '21 at 17:23

1 Answers1

0

std::string has no encoding of its own. It can easily hold characters encoded in ASCII, UTF-8, ISO-8859-x, Windows-125x, etc. They are just raw bytes, as far as std::string is concerned. So, before you can print your output in ISO-8859-1 specifically, you need to first know what the std::string is already holding so it can be converted to ISO-8859-1 if needed.

FYI, ffffffc3ffffff89 is simply the two char values 0xc3 0x89 (the UTF-8 encoded form of É) being sign-extended to 32 bits. Which means your compiler implements char as a signed type rather than an unsigned type. To eliminate the leading fs, you need to cast each char to unsigned char before then casting to unsigned. You also will need to account for unsigned values < 10 so that the output is an even multiple of 2 hex digits per char, eg:

strstream << std::hex << std::setw(2) << std::setfill('0') << static_cast<unsigned>(static_cast<unsigned char>(input[i]));

So, it appears that your std::string is encoded in UTF-8. There are plenty of libraries available that can convert text from one encoding to another, such as ICU or ICONV. Or platform-specific APIs, like WideCharToMultiByte()/MultiByteToWideChar() on Windows, std::mbstowcs()/std::wcstombs(), etc (provided suitable locales are installed in the OS). But there is nothing really built-in to C++ for this exact UTF-8 to ISO-8859-1 conversion. Though, you could use the (deprecated) std::wstring_convert to decode the UTF-8 std::string to a UTF-16/32 encoded std::wstring, or a UTF-16 encoded std::u16string, at least. And then you can convert that to ISO-8859-1 using whatever library you want as needed.

Or, knowing that the input is UTF-8 and the output is ISO-8859-1, it is really not that hard to just convert the data manually, decoding the UTF-8 into codepoints, and then encoding those codepoints to bytes. Both encodings are well-documented and fairly easy to write code for without too much effort, eg:

size_t nextUtf8CodepointLen(const char* data)
{
    unsigned char ch = static_cast<unsigned char>(*data);

    if ((ch & 0x80) == 0) {
        return 1;
    }

    if ((ch & 0xE0) == 0xC0) {
        return 2;
    }

    if ((ch & 0xF0) == 0xE0) {
        return 3;
    }

    if ((ch & 0xF8) == 0xF0) {
        return 4;
    }

   return 0;
}

unsigned nextUtf8Codepoint(const char* &data, size_t &data_size)
{
    if (data_size == 0) return -1;

    unsigned char ch = static_cast<unsigned char>(*data);
    size_t len = nextUtf8CodepointLen(data);

    ++data;
    --data_size;

    if (len < 2) {
        return (len == 1) ? static_cast<unsigned>(ch) : 0xFFFD;
    }

    --len;

    unsigned cp;

    if (len == 1) {
        cp = ch & 0x1F;
    }
    else if (len == 2) {
        cp = ch & 0x0F;
    }
    else {
        cp = ch & 0x07;
    }

    if (len > data_size) {
        data += data_size;
        data_size = 0;
        return 0xFFFD;
    }

    for(size_t j = 0; j < len; ++j) {
        ch = static_cast<unsigned char>(data[j]);

        if ((ch & 0xC0) != 0x80) {
            cp = 0xFFFD;
            break;
        }

        cp = (cp << 6) | (ch & 0x3F);
    }

    data += len;
    data_size -= len;

    return cp;
}

std::string Helper::ToHex(const std::string &input) {
    const char *data = input.c_str();
    size_t data_size = input.size();

    std::ostringstream oss;
    unsigned cp;

    while ((cp = nextUtf8Codepoint(data, data_size)) != -1) {
        if (cp > 0xFF) {
            cp = static_cast<unsigned>('?');
        }
        oss << std::hex << std::setw(2) << std::setfill('0') << cp;
    }

    return oss.str();
}

Online Demo

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770