6

Somehow I couldn't find the answer in Google. Probably I'm using the wrong terminology when I'm searching. I'm trying to perform a simple task, convert a number that represents a character to the characters itself like in this table: http://unicode-table.com/en/#0460

For example, if my number is 47 (which is '\'), I can just put 47 in a char and print it using cout and I will see in the console a backslash (there is no problem for numbers lower than 256).

But if my number is 1120, the character should be 'Ѡ' (omega in Latin). I assume it is represented by several characters (which cout would know to convert to 'Ѡ' when it prints to the screen).

How do I get these "several characters" that represent 'Ѡ'?

I have a library called ICU, and I'm using UTF-8.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
OopsUser
  • 4,642
  • 7
  • 46
  • 71
  • Some information here : http://stackoverflow.com/questions/16208079/how-to-work-with-utf-8-in-c-conversion-from-other-encodings-to-utf-8 – Holt Apr 27 '14 at 10:57
  • There's no characters that you can touch with your fingers, or nail to a wall, or store in a computer. Characters are abstract mathematical entities, just like numbers. You can think of a character, but not actually have it in a tangible form. All you can have is a *representation* of a character. The "unicode number" you already have is a perfectly good representation of a character. If you need some other representation, you need to know which one. – n. m. could be an AI Apr 27 '14 at 11:04
  • BTW: Some characters are number (unicode codepoint) sequences, not all of those always though. To your question: Just do a recode from UTF-32 to UTF-8. The [tag:utf-8] tag wiki has a link to the official algorithm. – Deduplicator Apr 27 '14 at 11:06
  • 1
    (Continued) UTF-8 is another good representation, if you need that, you can use u_strFromUTF32 and u_strToUTF8 to convert from "unicode numbers" (that's UTF32) to ICU strings and from ICU strings to UTF8, respectively. (ICU uses UTF16 internally, so no direct conversion from UTF32 to UTF8). – n. m. could be an AI Apr 27 '14 at 11:18
  • n.m thanks for that last comment, i will try it :) – OopsUser Apr 27 '14 at 11:19
  • “which `cout` would know to convert to 'Ѡ'” – actually `cout` doesn’t know anything and does not convert. It just passes whatever bytes it gets through to the system. – Konrad Rudolph Apr 27 '14 at 11:58

2 Answers2

8

What you call Unicode number is typically called a code point. If you want to work with C++ and Unicode strings, ICU offers a icu::UnicodeString class. You can find the documentation here.

To create a UnicodeString holding a single character, you can use the constructor that takes a code point in a UChar32:

icu::UnicodeString::UnicodeString(UChar32 ch)

Then you can call the toUTF8String method to convert the string to UTF-8.

Example program:

#include <iostream>
#include <string>

#include <unicode/unistr.h>

int main() {
    icu::UnicodeString uni_str((UChar32)1120);
    std::string str;
    uni_str.toUTF8String(str);
    std::cout << str << std::endl;

    return 0;
}

On a Linux system like Debian, you can compile this program with:

g++ so.cc -o so -licuuc

If your terminal supports UTF-8, this will print an omega character.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
  • 1
    Good example, just a couple of comments - you might use #include as that is the typical convention. Also, if you #include you can just do `std::cout << uni_str << std::endl` – Steven R. Loomis Apr 29 '14 at 00:22
0

Another alternative is to do it using only standard components. The following example treats the Unicode code point as a std::u32string and returns it as a std::string.

Creating a std::u32string with a Unicode code point is simple:

Method 1: using brace init (calling `initializer_list ctor)

std::u32string u1{codePointNumber};
// For example:
std::u32string u1{305}; // 305 is 'ı'

Method 2: using operator +=

std::u32string u2{}; // Empty string
// For example:
u2 += 305;

To convert std::u32string to a std::string, you can use std::wstring_convert from the <locale> header:

#include <iostream>

#include <codecvt>
#include <string>
#include <locale>

std::string U32ToStr(const std::u32string& str)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    return conv.to_bytes(str);
}

int main()
{
    std::u32string u1{305};
    std::cout << U32ToStr(u1) << "\n";
    return 0;
}

example 1 from goldbold

Note that std::wstring_convert is deprecated (yet not removed) in C++17 and later, so you may want to use an alternative method if you are using a newer version of C++.