Convert from UTF-8 to unicode c++

Question

How do I convert ú within a c++ application where the application receives the character as UTF-8 encoding %C3%BA and store it as the unicode equivalent %FA. I just want to know how I would go about writing code to perform this encoding process

http://msdn.microsoft.com/en-us/library/dd374130(v=vs.85).aspx ? — Zac Howland, Aug 30 '13 at 13:52
Just for the record, with regards to your title: UTF-8 _is_ Unicode. And the standard way of specifying the code point would be `U+00FA` (with at least 4 hex digits, but up to 6). — James Kanze, Aug 30 '13 at 13:58
You look up the rules for UTF-8, unicode and url encoding etc. and you implement them in code. I don't know any other way to answer the question. It might help you progress if you said specifically where you are stuck. I would break the problem into three steps, URL-decode (convert %xy etc. to character value), UTF-8 to unicode code point (this is converts for instance C3 BA to FA, this is the difficult step), URL-encode (put back the %'s). Each of these steps is simpler than the overall problem, just pick the easiest and code that one first. — john, Aug 30 '13 at 14:07

score 8 · Accepted Answer · answered Aug 30 '13 at 14:00

I just wrote some code to do this yesterday...

I'm not saying this is the "perfect" way to do this, but it appears to work for all testcases I've run through it (I wrote both directions for that purpose).

I'll leave it to you to translate "%NN" to an integer value.

#include <iostream>
#include <deque>

std::deque<int> unicode_to_utf8(int charcode)
{
    std::deque<int> d;
    if (charcode < 128)
    {
        d.push_back(charcode);
    }
    else
    {
        int first_bits = 6; 
        const int other_bits = 6;
        int first_val = 0xC0;
        int t = 0;
        while (charcode >= (1 << first_bits))
        {
            {
                t = 128 | (charcode & ((1 << other_bits)-1));
                charcode >>= other_bits;
                first_val |= 1 << (first_bits);
                first_bits--;
            }
            d.push_front(t);
        }
        t = first_val | charcode;
        d.push_front(t);
    }
    return d;
}


int utf8_to_unicode(std::deque<int> &coded)
{
    int charcode = 0;
    int t = coded.front();
    coded.pop_front();
    if (t < 128)
    {
        return t;
    }
    int high_bit_mask = (1 << 6) -1;
    int high_bit_shift = 0;
    int total_bits = 0;
    const int other_bits = 6;
    while((t & 0xC0) == 0xC0)
    {
        t <<= 1;
        t &= 0xff;
        total_bits += 6;
        high_bit_mask >>= 1; 
        high_bit_shift++;
        charcode <<= other_bits;
        charcode |= coded.front() & ((1 << other_bits)-1);
        coded.pop_front();
    } 
    charcode |= ((t >> high_bit_shift) & high_bit_mask) << total_bits;
    return charcode;
}

int main()
{
    int charcode; 

    for(;;)
    {
        std::cout << "Enter unicode value:" << std::endl;
        std::cin >> charcode; 
        auto x = unicode_to_utf8(charcode);
        for(auto c : x)
        {
            std::cout << "\\x" << std::hex << c << " ";
        }
        std::cout << std::endl;
        int c = utf8_to_unicode(x);
        std::cout << "reversed:" << std::dec << c << std::hex << " in hex:" << c << std::endl;
    }
}

The code contains BOTH directions - from a deque to unicode and from unicode to deque. It just doesn't happen to have the "required" code FIRST, I wasn't going to reformat my code... — Mats Petersson, Aug 30 '13 at 14:08
Just a little note regarding naming; I suggest the names `utf32_to_utf8` and `utf8_to_utf32`; the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16. — avakar, Aug 30 '13 at 14:38
Yes, name isn't great, the REAL code that I use this in (in PHP, the above was just a hack to test the principle) is called `utf8_to_html`, and produces a `"ሴ"` string. — Mats Petersson, Aug 30 '13 at 15:50
@MatsPetersson Thanks for the code above, I'm struggling to implement this into my code as I'm new to c++. How will the string %C3B%A be converted using this code? — user2724841, Sep 04 '13 at 11:13
You will have to split at the `%` signs, and then convert from hex to a `deque`. The basic principle is that the start of a UTF-8 code has the at least two highest bits set (hence the `t & 0xc0 == 0xc0`), followed by a zero bit, and "payload" bits (2-5 bits). The remaining bytes have `10` in the highest two bits, and then 6 bits of "payload" in the lower bits. In your case it's a two byte encoding, so first byte contains the upper 5 bits and the second byte the lower 6. BA is 10111010, so gives 111010 (as the lower bits), C3 is 0x11000011, so adds 00011, giving 0001111010 = 0x0FA. — Mats Petersson, Sep 04 '13 at 12:19

score 1 · Answer 2 · answered Jan 14 '20 at 09:37

This is actually in the standard libray:

#include <string>
#include <codecvt> // for std::codecvt_utf8
#include <locale>  // for std::wstring_convert


std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv_utf8_utf32;


int main() {

    std::string utf8_bytes = "ú";
    std::u32string unicode_codepoints = conv_utf8_utf32.from_bytes(utf8_bytes);

    return 0;
}

The other way around is done with conv_utf8_utf32.to_bytes.

Example with printing in your %hex format using printf:

#include <string>
#include <codecvt> // for std::codecvt_utf8
#include <locale>  // for std::wstring_convert
#include <cstdio>


std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv_utf8_utf32;


int main() {

    std::string utf8_bytes = "ú";
    // print the bytes in %hex format
    for (char byte: utf8_bytes) {
        printf("%%%2X", reinterpret_cast<unsigned char&>(byte));
    }   
    printf("\n");


    std::u32string unicode_codepoints = conv_utf8_utf32.from_bytes(utf8_bytes);

    // print the code points in %hex format
    for (char32_t chr: unicode_codepoints) {
        printf("%%%2X", chr);
    }   
    printf("\n");


    return 0;
}

Convert from UTF-8 to unicode c++

2 Answers2

Linked

Related