1

So I wanted to try converting Unicode to an integer for a project of mine. I tried something like this :

 unsigned int foo = (unsigned int)L'آ'; 
 std::cout << foo << std::endl;

How do I convert it back? Or in other words, How do I convert an int to the respective Unicode character ?

EDIT : I am expecting the output to be the unicode value of an integer, example:

cout << (wchar_t) 1570 ; // This should print the unicode value of 1570 (which is :آ)

I am using Visual Studio 2013 Community with it's default compiler, Windows 10 64 bit Pro

Cheers

Samuel
  • 315
  • 3
  • 14
  • Cast it to `wchar_t` maybe? – Vilx- Jun 04 '19 at 12:51
  • 1
    please note that `wchar_t` and `L'آ'` have diffrent size on Windows (2 bytes) and Linux/Unix/BSD systems (4 bytes). So Unicode integer to `wchar_t` doesn't have to be a simple casting! – Marek R Jun 04 '19 at 12:53
  • That depends on what encoding you want to use. – Fureeish Jun 04 '19 at 12:54
  • 1
    Not clear what you mean. Characters are integer types in C++. It would improve the question to explain what output behaviour you are trying to produce – M.M Jun 04 '19 at 12:55
  • 1
    Also useful to mention the compiler, OS and runtime environment, as a lot of unicode stuff is implementation-defined and relies on OS or shell support etc. – M.M Jun 04 '19 at 12:56
  • I am using Visual studio 2013 community and windows 10 pro 64bit – Samuel Jun 04 '19 at 13:15
  • "*This should print the unicode value of 1570*" - to my knowledge, this is impossible, unless you modify the encoding of your output device (most probably a console). Unicode defines code points. You need encoding to actually store the values using bytes. – Fureeish Jun 04 '19 at 13:17
  • You must first be sure that your console is actually capable of printing these. What happens if you just do `std::wcout << L'آ';`? (BTW, use `wcout` if you want to print wide chars) – Not a real meerkat Jun 04 '19 at 13:22
  • https://stackoverflow.com/q/12055197/560648 – Lightness Races in Orbit Jun 04 '19 at 13:39

2 Answers2

3

L'آ' will work okay as a signle wide character, because it is below 0xFFFF. But in general UTF16 includes surrogate pairs, so a unicode code point cannot be represented with a single wide character. You need wide string instead.

Your problem is also partly to do with printing UTF16 character in Windows console. If you use MessageBoxW to view a wide string it will work as expected:

wchar_t buf[2] = { 0 };
buf[0] = 1570;
MessageBoxW(0, buf, 0, 0);

However, in general you need a wide string to account for surrogate pairs, not a single wide char. Example:

int utf32 = 1570;

const int mask = (1 << 10) - 1;
std::wstring str;
if(utf32 < 0xFFFF)
{
    str.push_back((wchar_t)utf32);
}
else
{
    utf32 -= 0x10000;
    int hi = (utf32 >> 10) & mask;
    int lo = utf32 & mask;

    hi += 0xD800;
    lo += 0xDC00;

    str.push_back((wchar_t)hi);
    str.push_back((wchar_t)lo);
}

MessageBox(0, str.c_str(), 0, 0);

See related posts for printing UTF16 in Windows console.

Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77
1

The key here is setlocale(LC_ALL, "en_US.UTF-8");. en_US is the localization string which you may want to set to a different value like zh_CN for Chinese for example.

#include <stdio.h>
#include <iostream>

int main() {
    setlocale(LC_ALL, "en_US.UTF-8");
    // This does not work without setlocale(LC_ALL, "en_US.UTF-8");
    for(int ch=30000; ch<30030; ch++) {
        wprintf(L"%lc", ch);
    }
    printf("\n");
    return 0;
}

Things to notice here is the use of wprintf and how the formatted string is given: L"%lc" which tells wprintf to treat the string and the character as long characters.

If you want to use this method to print some variables, use the type wchat_t.

Useful links:

Steak Overflow
  • 7,041
  • 1
  • 37
  • 59