3

I just now heard about the existence of char8_t, char16_t and char32_t and I am testing it out. When I try to compile the code below, g++ throws the following error:

error: use of deleted function ‘std::basic_ostream<char, _Traits>& std::operator<<(basic_ostream<char, _Traits>&, char32_t) [with _Traits = char_traits<char>]’
    6 |         std::cout << U'' << std::endl;
      |                      ^~~~~
#include <iostream>

int main() {
  char32_t c = U'';

  std::cout << c << std::endl;

  return 0;
}

Additionally, why can't I put the emoji into a char8_t or char16_t? For example, the following lines of code don't work:

char16_t c1 = u'';
char8_t c2 = u8'';
auto c3 = u'';
auto c4 = u8'';

From my understanding, emojis are UTF-8 characters and should therefore fit into a char8_t.

Sheldon
  • 376
  • 3
  • 14
  • 3
    characters encoded in utf-8 can be more than 1 byte. And that's definitely the case for emojis – Kevin Feb 26 '23 at 21:58
  • This is just problem of encoding. Which compiler are you using and what on platform? – Marek R Feb 26 '23 at 22:00
  • Use `char const* c = "";` – chrysante Feb 26 '23 at 22:02
  • Here is similar problem where I explain how to handle this on MSVC: https://stackoverflow.com/a/67819605/1387438 Note that if you are suing Windows and MinGW, then support of locale is poor. On other platforms implicit use of UTF-8 should make this work quite easily. – Marek R Feb 26 '23 at 22:03
  • There's not enough space in a `uint8_t` to contain all the ASCII characters and all the emoji's. You'll need a data structure with more space. – Thomas Matthews Feb 26 '23 at 22:08

4 Answers4

5

emojis are UTF-8 characters

There is no such thing as a "UTF-8 character".

There are Unicode codepoints. These can be represented in the UTF-8 encoding, such that each codepoint maps to a sequence of one or more UTF-8 code units: char8_ts. But that means that most codepoints map to multiple char8_ts: AKA, a string. And Emojis are not among the 127 codepoints that map to a single UTF-8 code unit.

Emoji in particular can be built out of multiple codepoints, so even using UTF-32, you cannot guarantee that any emoji could be stored in a single char32_t codepoint.

It's best to treat these things as strings, not characters, at all times. Forget that "characters" even exist.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
3

Code

#include <iostream>

#ifdef _WIN32 
#include <Windows.h>
#define SET_CONSOLE_UTF8 SetConsoleCP(CP_UTF8); SetConsoleOutputCP(CP_UTF8); //Set console output to UTF-8.Visual C++ code on Windows.
#endif // _WIN32 


#if defined(__cpp_char8_t) | defined(__cpp_lib_char8_t)

//Operator <<
std::ostream& operator<<(std::ostream& os, const std::u8string& str)
{
    os << reinterpret_cast<const char*>(str.data());
    return os;
}

//Convert u8string to string.
std::string ToString(const std::u8string& s) {
    return std::string(s.begin(), s.end());
}

std::u8string Tou8String(const std::string& s) {
    return std::u8string(s.begin(), s.end());
}

//const char8_t* literal to string. Operator ""_s
static inline std::string operator"" _s(const char8_t* value, size_t size) {
    static std::string x(reinterpret_cast<const char*>(value), size);
    return x;
}

#endif


using namespace std::string_literals;// operator ""s

int main() {
#ifdef _WIN32
    SET_CONSOLE_UTF8
#endif

    std::u8string u8String = u8""s;// u8string literal.
    std::string str = u8""_s; //Operator "_s". Convert utf8 literal(const char8_t*) to std::string. 

    std::cout << "string              " << str << std::endl; //Using operator << for std::string
    std::cout << "u8string -> string  " << ToString(u8String) << std::endl; //Using function ToString(u8string) -> string
    std::cout << "u8string            " << u8String << std::endl; //Using operator << for std::u8string.
    std::cout << "string -> u8string  " << Tou8String(str) << std::endl; //Using function Tou8String(string) -> u8string

    std::cin.get();
    return 0;
}

Output Windows Terminal and https://godbolt.org/(Clang and GCC)

string              
u8string -> string  
u8string            
string -> u8string  

VisualC++

godbolt - CLANG

GCC

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Joma
  • 3,520
  • 1
  • 29
  • 32
  • 1
    https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question Screenshots to godbolt pages is sick. You should share a link to godbolt, then anyone can play with it. Note that godbolt implicitly uses UTF-8 encoding. This doesn't have to be default on real systems. – Marek R Feb 27 '23 at 13:08
  • I just updated the answer. Thanks for your recommendation. – Joma Feb 27 '23 at 15:02
2

When I try to compile the code below, g++ throws the following error:

The encoding expected by the narrow and wide standard streams is implementation-dependent and may also depend on what the terminal you are ultimately printing to expects. You need to convert your character to the correct encoding as either char or wchar_t type if you want to print to std::cout or std::wcout respectively.

Additionally, why can't I put the emoji into a char8_t or char16_t? For example, the following lines of code don't work:

The emoji is unicode code point U+1F60B which in both UTF-8 and UTF-16 encoding requires multiple code units. But you are trying to form a character literal, which holds only one code unit.

From my understanding, emojis are UTF-8 characters [...]

That doesn't make sense. UTF-8 is an encoding for unicode code points. It doesn't make sense to say that a character "is UTF-8". This shows that you might have fundamental misunderstandings on how Unicode and character/string encodings in general work. I would suggest you read some introduction on the topic.

user17732522
  • 53,019
  • 2
  • 56
  • 105
2

This works

#include <iostream>

int main() {
  const char* c = "";

  std::cout << c << std::endl;

  return 0;
}

Explanation.

  1. is a multibyte sequence and does not fit in a single char. Thus const char* should be used.
  2. The default source file encoding is UTF-8, thus Unicode chars can be used only in UTF-8 encoding. For char32_t it should be written as U'\x1F60B'.
  3. operator<<(std::basic_ostream) is deleted for char8_t, char16_t and char32_t.
273K
  • 29,503
  • 10
  • 41
  • 64
  • On Windows it will not work out of the box, since system usually uses country specific one byte encodings, where there is no full Unicode support. – Marek R Feb 26 '23 at 22:06
  • "*The default source file encoding is UTF-8*" For which compiler? – Nicol Bolas Feb 26 '23 at 22:07
  • @NicolBolas on Linux gcc and clang are using UTF-8 implicitly. – Marek R Feb 26 '23 at 22:08
  • @MarekR Just tried in VS and it offered to save a file in UTF-8 with a BOM. So, this is not an issue. – 273K Feb 26 '23 at 22:11
  • auto also support this ;) https://godbolt.org/z/97b1exsc7 – Eriss69 Feb 26 '23 at 22:57
  • @273K it depends on you machine settings. If you configured your machine to use code page 65001 (UTF-8) this will work out of the box. I'm currently using codpeage 437 (or 1252) and outcome on my console is `≡ƒÿï`. I have to add `std::locale::global(std::locale(".utf-8")); std::cout.imbue(std::locale(""));` to make it work and change console code page to something what supports this character (`chcp 65001`). – Marek R Feb 27 '23 at 09:42