How can I convert string like "\u94b1" to one real character in C++?

Question

We know in string literal, "\u94b1" will be converted to a character, in this case a Chinese word '钱'. But if it is literally 6 character in a string, saying '\', 'u', '9', '4', 'b', '1', how can I convert it to a character manually?

For example:

string s1;
string s2 = "\u94b1";
cin >> s1;            //here I input \u94b1
cout << s1 << endl;   //here output \u94b1
cout << s2 << endl;   //and here output 钱

I want to convert s1 so that cout << s1 << endl; will also output 钱.

Any suggestion please?

Possible duplicate of http://stackoverflow.com/questions/3147900/how-to-read-file-which-contains-uxxxx-in-vc — kennytm, Jun 01 '16 at 07:19

score 4 · Accepted Answer · answered Jun 01 '16 at 10:27

In fact the conversion is a little more complicated.

string s2 = "\u94b1";

is in fact the equivalent of:

char cs2 = { 0xe9, 0x92, 0xb1, 0}; string s2 = cs2;

That means that you are initializing it the the 3 characters that compose the UTF8 representation of 钱 - you char just examine s2.c_str() to make sure of that.

So to process the 6 raw characters '\', 'u', '9', '4', 'b', '1', you must first extract the wchar_t from string s1 = "\\u94b1"; (what you get when you read it). It is easy, just skip the two first characters and read it as hexadecimal:

unsigned int ui;
std::istringstream is(s1.c_str() + 2);
is >> hex >> ui;

ui is now 0x94b1.

Now provided you have a C++11 compliant system, you can convert it with std::convert_utf8:

wchar_t wc = ui;
std::codecvt_utf8<wchar_t> conv;
const wchar_t *wnext;
char *next;
char cbuf[4] = {0}; // initialize the buffer to 0 to have a terminating null
std::mbstate_t state;
conv.out(state, &wc, &wc + 1, wnext, cbuf, cbuf+4, next);

cbuf contains now the 3 characters representing 钱 in utf8 and a terminating null, and you finaly can do:

string s3 = cbuf;
cout << s3 << endl;

Thanks for the `stringstream` way. I make out a function to translate all `\uxxxx` things to utf8 characters — Eric Zheng, Jun 02 '16 at 06:08

score 2 · Answer 2 · answered Jun 01 '16 at 09:10

2

You do this by writing code that checks whether the string contains a backslash, a letter u, and four hexadecimal digits, and converts this to a Unicode code point. Then your std::string implementation probably assumes UTF-8, so you translate that code point into 1, 2, or 3 UTF-8 bytes.

For extra points, figure out how to enter code points outside the basic plane.

answered Jun 01 '16 at 09:10

gnasher729

51,477
5
75
98

Does `std::string` even assume an encoding? I always thought it was a dumb container of characters which may be bytes, code units, code points or whatever, depending on the implementation and no part of `std::string` supports anything like working with text (e.g. Unicode normalization, language-aware ordering, etc.). You get an array of things. How that maps to text is not C++'s job. – Joey Jun 01 '16 at 09:15
@Joey "How that maps to text is not C++'s job". Not quite. std::string may not assume an encoding but other parts of C++ definitely do. If it has to do with locales then it probably has some idea about one or more encodings. – n. m. could be an AI Jun 01 '16 at 09:26
Thanks for the inspiration! – Eric Zheng Jun 02 '16 at 06:11

score 1 · Answer 3 · 2016-06-01T19:28:38.533

With utfcpp (header only) you may do:

#include </usr/include/utf8.h>

#include <cstdint>
#include <iostream>

std::string replace_utf8_escape_sequences(const std::string& str) {
    std::string result;
    std::string::size_type first = 0;
    std::string::size_type last = 0;
    while(true) {
        // Find an escape position
        last = str.find("\\u", last);
        if(last == std::string::npos) {
            result.append(str.begin() + first, str.end());
            break;
        }

        // Extract a 4 digit hexadecimal
        const char* hex = str.data() + last + 2;
        char* hex_end;
        std::uint_fast32_t code = std::strtoul(hex, &hex_end, 16);
        std::string::size_type hex_size = hex_end - hex;

        // Append the leading and converted string
        if(hex_size != 4) last = last + 2 + hex_size;
        else {
            result.append(str.begin() + first, str.begin() + last);
            try {
                utf8::utf16to8(&code, &code + 1, std::back_inserter(result));
            }
            catch(const utf8::exception&) {
                // Error Handling
                result.clear();
                break;
            }
            first = last = last + 2 + 4;
        }
    }
    return result;
}

int main()
{
    std::string source = "What is the meaning of '\\u94b1'  '\\u94b1' '\\u94b1' '\\u94b1' ?";
    std::string target = replace_utf8_escape_sequences(source);
    std::cout << "Conversion from \"" << source << "\" to \"" << target << "\"\n";
}

Useful help! I have looked into utfcpp, and make my function doing what `utf16to8` does, translating code point to several bytes of character, which is appended to the destination string. My works is pretty the same as yours. Anyway, thank you a lot. — Eric Zheng, Jun 02 '16 at 06:06

How can I convert string like "\u94b1" to one real character in C++?

3 Answers3