0

I have a string like this

string s = "0081";

and I need to make a one character string like this

string c = "\u0081"  

how can I make that string of length 1 from the original string of length 4?

EDIT: my mistake, "\u0081" is not a char (1 byte) but a 2 bytes character/string? so what I have as input is a binary, 1000 0001 which is 0x81, and this is where comes my string "0081". would it be easier to go from this 0x81 to a string c = "\u0081" whatever is that value? thanks for all the help

mlf
  • 3
  • 2
  • Have you tried to get it done? How did you fail? And are you sure you only want codepoints smaller `0x10000`? – Deduplicator Jan 19 '15 at 20:45
  • if you make something like string c = "\u"+"0081"; you get an error that this is an incomplete universal character name \u since c is a string of 1 character, is you try something like c.replace(0,1,"9"); you just replace everything and you don't have \uXXXX anymore but just "9" I cant get to define a one character (\u0081) from a 4 character string "0081" – mlf Jan 19 '15 at 20:46
  • You don't, you get an error, cannot add two pointers to `const char`. – Deduplicator Jan 19 '15 at 20:54
  • Have you tried `string c = "\u0081"`? I think you'll find that it's *not* a 1 character string. E.g. http://ideone.com/Ok7wnl – Mark Ransom Jan 19 '15 at 21:15
  • @MarkRansom: Depends on which definition of "character" you use. Which makes unicode such fun. – Deduplicator Jan 19 '15 at 21:20
  • yes I see that its not a 1 character string. unfortunately I cannot just hardcode the string is define only at execution time. so I really need to be able to define a string from "\u" compose with some "xxxx" value known later – mlf Jan 19 '15 at 21:21
  • @mlf does my solution work for you? – Axalo Jan 19 '15 at 21:24
  • @Axalo , Im trying to make it work, meaning that the result is pass to another function that only take a char* as argument. so passing from wstring to string to char*. – mlf Jan 19 '15 at 21:27
  • @mlf in that case you probably want to [convert the wide string to a multibyte string](http://en.cppreference.com/w/cpp/string/multibyte/wcstombs) – Axalo Jan 19 '15 at 21:30
  • "\u0081" doesn't seem to be the same as L"\u0081" and I really need "\u0081" otherwise its not working, unfortunately – mlf Jan 19 '15 at 21:31
  • My answer can get you from `0x81` to a string, just skip the `strtol` step and call `CodepointToUTF8(0x81)`. – Mark Ransom Jan 19 '15 at 22:15

2 Answers2

0

Here you go:

unsigned int x;
std::stringstream ss;
ss << std::hex << "1081";
ss >> x;

wchar_t wc1 = x;
wchar_t wc2 = L'\u1081';

assert(wc1 == wc2);

std::wstring ws(1, wc);
Axalo
  • 2,953
  • 4
  • 25
  • 39
  • He wants UTF-8, so no banana. Anyway, are you sure he does not want full unicode codepoints? – Deduplicator Jan 19 '15 at 20:55
  • @Deduplicator idk, I just showed him how to "make that string of length 1 from the original string of length 4" – Axalo Jan 19 '15 at 20:59
  • You didn't show him how to make a string of length 1. You make a wchar_t, which is different even from a wstring. You can easily create a `wstring` from a `wchar_t`. It would be easier to use `strtol`, than playing games with std::stringstream: `std::wstring ws(1, wchar_t(strtol("1081", 0, 16))`. However, the question was to produce a *string*, by implication in UTF-8. – rici Jan 19 '15 at 21:07
  • @rici: Actually explicit and not by implication, he gave the desired output. Also, a `wchar_t` on windows is unable to hold a full codepoint anyway. – Deduplicator Jan 19 '15 at 21:09
  • @Deduplicator: A bit contradictory, since the length of the string created explicitly is not 1. But I take your point. – rici Jan 19 '15 at 21:10
  • @rici how could the length be > 1? – Axalo Jan 19 '15 at 21:10
  • 1
    @rici: I simply take it as the standard confusion surrounding any question about unicode by someone not understanding the crucial distictions. So, he probably meant 1 unicode character (whether he thinks that's a codepoint or grapheme). – Deduplicator Jan 19 '15 at 21:11
  • 1
    It's UTF-8. "\u1081" is the three-byte sequence `e1 82 81`. (U+1081 MYANMAR LETTER SHAN HA, in case anyone is interested) – rici Jan 19 '15 at 21:12
  • @rici it's a wide string – Axalo Jan 19 '15 at 21:13
  • L"\u1081" is a wide string. "\u1081" is not. – rici Jan 19 '15 at 21:14
  • @Axalo: And OP explicitly wanted a narrow UTF-8 string. A standard string. – Deduplicator Jan 19 '15 at 21:14
  • well then it's just not possible without data loss – Axalo Jan 19 '15 at 21:15
  • yes I did meant a 1 unicode caracter and I have to say that I am someone not understanding much about why this is such a confusing problem. – mlf Jan 19 '15 at 21:16
  • @mlf see http://stackoverflow.com/questions/148403/utf8-to-from-wide-char-conversion-in-stl to get some idea of the complexity of the problem. That question actually gives you half of your answer. – Mark Ransom Jan 19 '15 at 21:31
0

Here's the whole process, based on some code I linked to in a comment elsewhere.

string s = "0081";
long codepoint = strtol(s.c_str(), NULL, 16);
string c = CodepointToUTF8(codepoint);

std::string CodepointToUTF8(long codepoint)
{
    std::string out;
    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}

Note that this code doesn't do any error checking, so if you pass it an invalid codepoint you'll get back an invalid string.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • okay wow, well this is working pretty good actually, may I ask what do I have to learn to understand why this is working? thanks a lot for the help! – mlf Jan 19 '15 at 22:17
  • @mlf you need to understand how UTF-8 is put together: http://en.wikipedia.org/wiki/UTF-8#Description – Mark Ransom Jan 19 '15 at 22:21