How can I to get the utf-8 int value of a unicode char from a (w)string?

Question

Situation

I need a function that expects a string and encodes all non-ascii chars to utf-8 as hexadecimal number and substitutes it with that.

For example, ӷ in a word like "djvӷdio" should be substituted with "d3b7" while the rest remains untouched.

Explanation:
ӷ equals int 54199 and in hexadecimal d3b7
djvӷdio --> djvd3b7dio

I already have a function that returns the hex value of an int.

My Machine

kubuntu, 19.10
Compiler: g++ (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008

My Ideas

1. Idea

std::string encode_utf8(const std::string &str);

With the use of the function above I iterate through the whole string which contains unicode and if the current char is non-ascii I replace it with its hex value.

Problem:

Iterating through a string with unicode is not clever as a unicode char is made out of up to 4 bytes unlike a normal char. Therefore, a unicode char can be treated as multiple chars which outputs garbage. In easy words, the string cannot be indexed.

2. Idea

std::string encode_utf8(const std::wstring &wstr);

Again, I iterate through the whole string with unicode chars and if the current char is non-ascii I replace it with its hex value.

Problem:

Indexing works now but it returns a wchar_t with the corresponding utf-32 number but I definitely need the utf-8 number.

How can I get a char out of a string from which I can get the utf-8 decimal number?

If my understanding of UTF-8 is correct, a character encoding can be anywhere between 1 and 4 bytes long. Also, in your `djvd3b7dio` example, what if the original string really did contain the literal `d3b7` as a sub-string, how would your decoder detect the difference? — 500 - Internal Server Error, Dec 10 '19 at 15:28
Please do not invent new encodings. Either use regular UTF-8, or if you must keep it 7-bit safe: quoted-printable or HTML entities. — Botje, Dec 10 '19 at 15:30
@500-InternalServerError Your understanding of utf-8 seems to be correct. I do not need to decode, just encode. — Spixmaster, Dec 10 '19 at 15:30
@Botje I do not get your point. I do not reinvent encoding. I simply want to get the utf-8 int number of a char from a string. — Spixmaster, Dec 10 '19 at 15:32
@Spixmaster "encodes all non-ascii chars to utf-8 as hexadecimal number and substitutes it with that." <- this is an encoding. Regular UTF-8 encodes a Unicode codepoint to bytes, you choose to encode it to hex characters. Ergo, this is a new encoding. Please use a standard one. Your future users and/or self will thank you. — Botje, Dec 10 '19 at 15:34
Why would you want to destroy information that is present in your string? — n. m. could be an AI, Dec 10 '19 at 17:23
@n.'pronouns'm. There is no information destroyed. It is simply substitued with the hex value of the proper utf-8 int. — Spixmaster, Dec 10 '19 at 17:25
Of course it is destroyed. After the substitition there is no way of knowing what was there, ӷ or d3b7. — n. m. could be an AI, Dec 10 '19 at 17:31
@n.'pronouns'm. Is "d3b7" which is the hex value of 54199 which represent the letter ӷ so no information is lost. — Spixmaster, Dec 10 '19 at 17:36
Let's try again. I give you a string that contains four ASCII characters d,e,a,d. It could represent an English word "dead" or a Unicode character U+07AD, which is 0xde 0xad in UTF-8. What does it represent? — n. m. could be an AI, Dec 10 '19 at 19:41
@n.'pronouns'm. I know what you mean. But if I differ hex from ascii by putting \x before that it is clear. One string is "dead" and the other "\xde\xad". — Spixmaster, Dec 10 '19 at 21:13

Lightness Races in Orbit · Answer 1 · 2019-12-10T16:55:22.020

2

Your input string is UTF8-encoded, which means each character is encoded by anything from one to four bytes. You cannot just scan through the string and convert them, unless your loop has an understanding of how Unicode characters are encoded in UTF8.

You need a UTF8 decoder.

Fortunately there are really lightweight ones you can use, if all you need is decoding. UTF8-CPP is pretty much one header, and has a feature to provide you with individual Unicode characters. utf8::next will feed you uint32_t (the "largest" character's codepoint fits into an object of this type). Now you can simply see whether the value is less than 128: if it is, cast to char and append; if it isn't, serialise the integer in whatever way you see fit.

I implore you to consider whether this is really want you want to do, though. Your output will be ambiguous. It'll be impossible to determine whether a bunch of numbers in it are actual numbers, or the representation of some non-ASCII character. Why not just stick with the original UTF8 encoding, or use something like HTML entity encoding or quoted-printable? These encodings are widely understood and widely supported.

edited Dec 10 '19 at 16:55

answered Dec 10 '19 at 15:32

Lightness Races in Orbit

378,754
76
643
1,055

I understand your concern as I left out some points for the sake of shortness. What I need is Python's encode function in C++. `string = "pythön!"` `string = string.encode()` `pythön! --> pyth\xc3\xb6n!` – Spixmaster Dec 10 '19 at 15:39
But that is just the external representation of the bytes `c3 b6`! What you asked for were literal hex characters, so the four-byte string `63 33 62 36`. – Botje Dec 10 '19 at 15:41
@Spixmaster That's not what you said at all. – Lightness Races in Orbit Dec 10 '19 at 16:06
@LightnessRaceswithMonica I do not understand why that is not what I said. I am sorry for the understanding problems. Is the method you suggested still correct? – Spixmaster Dec 10 '19 at 16:15
1

@Spixmaster I'm no longer sure what it really is that you're asking for. Please give exact, specific, unambiguous requirements. – Lightness Races in Orbit Dec 10 '19 at 16:16
@LightnessRaceswithMonica I give this as an input for a function: `pythön!` The function shall return: `pyth\xc3\xb6n!`. Am I expressing myself precisely enough? – Spixmaster Dec 10 '19 at 16:18
@Spixmaster No, you're not. Is that the literal, ASCII output that you want? (That's not what `string.encode()` does). Or is a string-literal representation of what you want? (That's already what your input is!!) Tell us exactly _what bytes you want_. – Lightness Races in Orbit Dec 10 '19 at 16:22
@LightnessRaceswithMonica Yes, `pyth\xc3\xb6n!` is the ascii output that I want. – Spixmaster Dec 10 '19 at 16:25
@Spixmaster Then the Unicode character U+00C3 is given to you as an integer with value 0xC3 (195), and you just make your serialiser output the ASCII "\" "x" "c" "3", for that integer, right? It didn't seem to me that you were asking how to do the serialisation part - for that you could ask a new question, though it's a pretty well-covered topic both on SO and more widely on the web. – Lightness Races in Orbit Dec 10 '19 at 16:55
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/203992/discussion-between-spixmaster-and-lightness-races-with-monica). – Spixmaster Dec 10 '19 at 16:58
@Spixmaster I have nothing to add at this time. – Lightness Races in Orbit Dec 10 '19 at 17:02
Is `\xc3` 1 byte or 4 bytes? – Mark Ransom Dec 10 '19 at 17:04
@MarkRansom I _think_ they're saying it's four. My attempts to gain further unambiguous clarification on the matter have unfortunately failed. – Lightness Races in Orbit Dec 10 '19 at 17:04
Sorry, I forgot to address my question to @Spixmaster. There's no such thing as a bad question, but unfortunately there's no shortage of poorly expressed ones. – Mark Ransom Dec 10 '19 at 17:10
@MarkRansom I am sorry for the poor expression. I tried my best. I hope that I found a solution with that I can continue my work: https://stackoverflow.com/questions/42012563/convert-unicode-code-points-to-utf-8-and-utf-32 – Spixmaster Dec 10 '19 at 17:14

score 0 · Accepted Answer · answered Dec 11 '19 at 20:55

I just solved the issue:

std::string Tools::encode_utf8(const std::wstring &wstr)
{
    std::string utf8_encoded;

    //iterate through the whole string
    for(size_t j = 0; j < wstr.size(); ++j)
    {
        if(wstr.at(j) <= 0x7F)
            utf8_encoded += wstr.at(j);
        else if(wstr.at(j) <= 0x7FF)
        {
            //our template for unicode of 2 bytes
            int utf8 = 0b11000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the last 5 remaining bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00000111'11000000) << 2;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\\x").insert(4, "\\x"));
        }
        else if(wstr.at(j) <= 0xFFFF)
        {
            //our template for unicode of 3 bytes
            int utf8 = 0b11100000'10000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the next 6 bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00001111'11000000) << 2;

            /*
             * get the last 4 remaining bits
             * put them 4 to the left so that the 10xx from 10xxxxxx (second byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b11110000'00000000) << 4;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\\x").insert(4, "\\x").insert(8, "\\x"));
        }
        else if(wstr.at(j) <= 0x10FFFF)
        {
            //our template for unicode of 4 bytes
            int utf8 = 0b11110000'10000000'10000000'10000000;

            //get the first 6 bits and save them
            utf8 += wstr.at(j) & 0b00111111;

            /*
             * get the next 6 bits
             * put them 2 to the left so that the 10 from 10xxxxxx (first byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00001111'11000000) << 2;

            /*
             * get the next 6 bits
             * put them 4 to the left so that the 10xx from 10xxxxxx (second byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00000011'11110000'00000000) << 4;

            /*
             * get the last 3 remaining bits
             * put them 6 to the left so that the 10xxxx from 10xxxxxx (third byte) is not overwritten
             */
            utf8 += (wstr.at(j) & 0b00011100'00000000'00000000) << 4;

            //append to the result
            std::string temp = Tools::to_hex(utf8);
            utf8_encoded.append(temp.insert(0, "\\x").insert(4, "\\x").insert(8, "\\x").insert(12, "\\x"));
        }
    }
    return utf8_encoded;
}