c++ making a unicode char from a string

Question

I have a string like this

string s = "0081";

and I need to make a one character string like this

string c = "\u0081"

how can I make that string of length 1 from the original string of length 4?

EDIT: my mistake, "\u0081" is not a char (1 byte) but a 2 bytes character/string? so what I have as input is a binary, 1000 0001 which is 0x81, and this is where comes my string "0081". would it be easier to go from this 0x81 to a string c = "\u0081" whatever is that value? thanks for all the help

Have you tried to get it done? How did you fail? And are you sure you only want codepoints smaller `0x10000`? — Deduplicator, Jan 19 '15 at 20:45
if you make something like string c = "\u"+"0081"; you get an error that this is an incomplete universal character name \u since c is a string of 1 character, is you try something like c.replace(0,1,"9"); you just replace everything and you don't have \uXXXX anymore but just "9" I cant get to define a one character (\u0081) from a 4 character string "0081" — mlf, Jan 19 '15 at 20:46
You don't, you get an error, cannot add two pointers to `const char`. — Deduplicator, Jan 19 '15 at 20:54
Have you tried `string c = "\u0081"`? I think you'll find that it's *not* a 1 character string. E.g. http://ideone.com/Ok7wnl — Mark Ransom, Jan 19 '15 at 21:15
@MarkRansom: Depends on which definition of "character" you use. Which makes unicode such fun. — Deduplicator, Jan 19 '15 at 21:20
yes I see that its not a 1 character string. unfortunately I cannot just hardcode the string is define only at execution time. so I really need to be able to define a string from "\u" compose with some "xxxx" value known later — mlf, Jan 19 '15 at 21:21
@Axalo , Im trying to make it work, meaning that the result is pass to another function that only take a char* as argument. so passing from wstring to string to char*. — mlf, Jan 19 '15 at 21:27
@mlf in that case you probably want to [convert the wide string to a multibyte string](http://en.cppreference.com/w/cpp/string/multibyte/wcstombs) — Axalo, Jan 19 '15 at 21:30
"\u0081" doesn't seem to be the same as L"\u0081" and I really need "\u0081" otherwise its not working, unfortunately — mlf, Jan 19 '15 at 21:31
My answer can get you from `0x81` to a string, just skip the `strtol` step and call `CodepointToUTF8(0x81)`. — Mark Ransom, Jan 19 '15 at 22:15

Axalo · Answer 1 · 2015-01-19T21:04:02.940

0

Here you go:

unsigned int x;
std::stringstream ss;
ss << std::hex << "1081";
ss >> x;

wchar_t wc1 = x;
wchar_t wc2 = L'\u1081';

assert(wc1 == wc2);

std::wstring ws(1, wc);

edited Jan 19 '15 at 21:04

answered Jan 19 '15 at 20:53

Axalo

2,953
4
25
39

He wants UTF-8, so no banana. Anyway, are you sure he does not want full unicode codepoints? – Deduplicator Jan 19 '15 at 20:55
@Deduplicator idk, I just showed him how to "make that string of length 1 from the original string of length 4" – Axalo Jan 19 '15 at 20:59
You didn't show him how to make a string of length 1. You make a wchar_t, which is different even from a wstring. You can easily create a `wstring` from a `wchar_t`. It would be easier to use `strtol`, than playing games with std::stringstream: `std::wstring ws(1, wchar_t(strtol("1081", 0, 16))`. However, the question was to produce a *string*, by implication in UTF-8. – rici Jan 19 '15 at 21:07
@rici: Actually explicit and not by implication, he gave the desired output. Also, a `wchar_t` on windows is unable to hold a full codepoint anyway. – Deduplicator Jan 19 '15 at 21:09
@Deduplicator: A bit contradictory, since the length of the string created explicitly is not 1. But I take your point. – rici Jan 19 '15 at 21:10
@rici how could the length be > 1? – Axalo Jan 19 '15 at 21:10
1

@rici: I simply take it as the standard confusion surrounding any question about unicode by someone not understanding the crucial distictions. So, he probably meant 1 unicode character (whether he thinks that's a codepoint or grapheme). – Deduplicator Jan 19 '15 at 21:11
1

It's UTF-8. "\u1081" is the three-byte sequence `e1 82 81`. (U+1081 MYANMAR LETTER SHAN HA, in case anyone is interested) – rici Jan 19 '15 at 21:12
@rici it's a wide string – Axalo Jan 19 '15 at 21:13
L"\u1081" is a wide string. "\u1081" is not. – rici Jan 19 '15 at 21:14
@Axalo: And OP explicitly wanted a narrow UTF-8 string. A standard string. – Deduplicator Jan 19 '15 at 21:14
well then it's just not possible without data loss – Axalo Jan 19 '15 at 21:15
yes I did meant a 1 unicode caracter and I have to say that I am someone not understanding much about why this is such a confusing problem. – mlf Jan 19 '15 at 21:16
@mlf see http://stackoverflow.com/questions/148403/utf8-to-from-wide-char-conversion-in-stl to get some idea of the complexity of the problem. That question actually gives you half of your answer. – Mark Ransom Jan 19 '15 at 21:31

score 0 · Accepted Answer · answered Jan 19 '15 at 21:53

Here's the whole process, based on some code I linked to in a comment elsewhere.

string s = "0081";
long codepoint = strtol(s.c_str(), NULL, 16);
string c = CodepointToUTF8(codepoint);

std::string CodepointToUTF8(long codepoint)
{
    std::string out;
    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}

Note that this code doesn't do any error checking, so if you pass it an invalid codepoint you'll get back an invalid string.

okay wow, well this is working pretty good actually, may I ask what do I have to learn to understand why this is working? thanks a lot for the help! — mlf, Jan 19 '15 at 22:17
@mlf you need to understand how UTF-8 is put together: http://en.wikipedia.org/wiki/UTF-8#Description — Mark Ransom, Jan 19 '15 at 22:21

c++ making a unicode char from a string

2 Answers2