How to Store Emojis in char8_t and Print Them Out in C++20?

Question

I just now heard about the existence of char8_t, char16_t and char32_t and I am testing it out. When I try to compile the code below, g++ throws the following error:

error: use of deleted function ‘std::basic_ostream<char, _Traits>& std::operator<<(basic_ostream<char, _Traits>&, char32_t) [with _Traits = char_traits<char>]’
    6 |         std::cout << U'' << std::endl;
      |                      ^~~~~

#include <iostream>

int main() {
  char32_t c = U'';

  std::cout << c << std::endl;

  return 0;
}

Additionally, why can't I put the emoji into a char8_t or char16_t? For example, the following lines of code don't work:

char16_t c1 = u'';
char8_t c2 = u8'';
auto c3 = u'';
auto c4 = u8'';

From my understanding, emojis are UTF-8 characters and should therefore fit into a char8_t.

characters encoded in utf-8 can be more than 1 byte. And that's definitely the case for emojis — Kevin, Feb 26 '23 at 21:58
This is just problem of encoding. Which compiler are you using and what on platform? — Marek R, Feb 26 '23 at 22:00
Here is similar problem where I explain how to handle this on MSVC: https://stackoverflow.com/a/67819605/1387438 Note that if you are suing Windows and MinGW, then support of locale is poor. On other platforms implicit use of UTF-8 should make this work quite easily. — Marek R, Feb 26 '23 at 22:03
There's not enough space in a `uint8_t` to contain all the ASCII characters and all the emoji's. You'll need a data structure with more space. — Thomas Matthews, Feb 26 '23 at 22:08

score 5 · Accepted Answer · answered Feb 26 '23 at 22:01

emojis are UTF-8 characters

There is no such thing as a "UTF-8 character".

There are Unicode codepoints. These can be represented in the UTF-8 encoding, such that each codepoint maps to a sequence of one or more UTF-8 code units: char8_ts. But that means that most codepoints map to multiple char8_ts: AKA, a string. And Emojis are not among the 127 codepoints that map to a single UTF-8 code unit.

Emoji in particular can be built out of multiple codepoints, so even using UTF-32, you cannot guarantee that any emoji could be stored in a single char32_t codepoint.

It's best to treat these things as strings, not characters, at all times. Forget that "characters" even exist.

score 3 · Answer 2 · edited Mar 22 '23 at 00:33

Code

Tested in Visual C++ using Windows Terminal. https://github.com/JomaStackOverflowAnswers/EmojiCpp20
GCC https://godbolt.org/z/cMbeoGf9a
Clang https://godbolt.org/z/EhfdaM61x

#include <iostream>

#ifdef _WIN32 
#include <Windows.h>
#define SET_CONSOLE_UTF8 SetConsoleCP(CP_UTF8); SetConsoleOutputCP(CP_UTF8); //Set console output to UTF-8.Visual C++ code on Windows.
#endif // _WIN32 


#if defined(__cpp_char8_t) | defined(__cpp_lib_char8_t)

//Operator <<
std::ostream& operator<<(std::ostream& os, const std::u8string& str)
{
    os << reinterpret_cast<const char*>(str.data());
    return os;
}

//Convert u8string to string.
std::string ToString(const std::u8string& s) {
    return std::string(s.begin(), s.end());
}

std::u8string Tou8String(const std::string& s) {
    return std::u8string(s.begin(), s.end());
}

//const char8_t* literal to string. Operator ""_s
static inline std::string operator"" _s(const char8_t* value, size_t size) {
    static std::string x(reinterpret_cast<const char*>(value), size);
    return x;
}

#endif


using namespace std::string_literals;// operator ""s

int main() {
#ifdef _WIN32
    SET_CONSOLE_UTF8
#endif

    std::u8string u8String = u8""s;// u8string literal.
    std::string str = u8""_s; //Operator "_s". Convert utf8 literal(const char8_t*) to std::string. 

    std::cout << "string              " << str << std::endl; //Using operator << for std::string
    std::cout << "u8string -> string  " << ToString(u8String) << std::endl; //Using function ToString(u8string) -> string
    std::cout << "u8string            " << u8String << std::endl; //Using operator << for std::u8string.
    std::cout << "string -> u8string  " << Tou8String(str) << std::endl; //Using function Tou8String(string) -> u8string

    std::cin.get();
    return 0;
}

Output Windows Terminal and https://godbolt.org/(Clang and GCC)

string              
u8string -> string  
u8string            
string -> u8string

https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question Screenshots to godbolt pages is sick. You should share a link to godbolt, then anyone can play with it. Note that godbolt implicitly uses UTF-8 encoding. This doesn't have to be default on real systems. — Marek R, Feb 27 '23 at 13:08

user17732522 · Answer 3 · 2023-02-26T22:08:26.330

When I try to compile the code below, g++ throws the following error:

The encoding expected by the narrow and wide standard streams is implementation-dependent and may also depend on what the terminal you are ultimately printing to expects. You need to convert your character to the correct encoding as either char or wchar_t type if you want to print to std::cout or std::wcout respectively.

Additionally, why can't I put the emoji into a char8_t or char16_t? For example, the following lines of code don't work:

The emoji is unicode code point U+1F60B which in both UTF-8 and UTF-16 encoding requires multiple code units. But you are trying to form a character literal, which holds only one code unit.

From my understanding, emojis are UTF-8 characters [...]

That doesn't make sense. UTF-8 is an encoding for unicode code points. It doesn't make sense to say that a character "is UTF-8". This shows that you might have fundamental misunderstandings on how Unicode and character/string encodings in general work. I would suggest you read some introduction on the topic.

273K · Answer 4 · 2023-02-26T22:09:00.037

2

This works

#include <iostream>

int main() {
  const char* c = "";

  std::cout << c << std::endl;

  return 0;
}

Explanation.

is a multibyte sequence and does not fit in a single char. Thus const char* should be used.
The default source file encoding is UTF-8, thus Unicode chars can be used only in UTF-8 encoding. For char32_t it should be written as U'\x1F60B'.
operator<<(std::basic_ostream) is deleted for char8_t, char16_t and char32_t.

edited Feb 26 '23 at 22:09

answered Feb 26 '23 at 22:03

273K

29,503
10
41
64

On Windows it will not work out of the box, since system usually uses country specific one byte encodings, where there is no full Unicode support. – Marek R Feb 26 '23 at 22:06
"*The default source file encoding is UTF-8*" For which compiler? – Nicol Bolas Feb 26 '23 at 22:07
@NicolBolas on Linux gcc and clang are using UTF-8 implicitly. – Marek R Feb 26 '23 at 22:08
@MarekR Just tried in VS and it offered to save a file in UTF-8 with a BOM. So, this is not an issue. – 273K Feb 26 '23 at 22:11
auto also support this ;) https://godbolt.org/z/97b1exsc7 – Eriss69 Feb 26 '23 at 22:57
@273K it depends on you machine settings. If you configured your machine to use code page 65001 (UTF-8) this will work out of the box. I'm currently using codpeage 437 (or 1252) and outcome on my console is `≡ƒÿï`. I have to add `std::locale::global(std::locale(".utf-8")); std::cout.imbue(std::locale(""));` to make it work and change console code page to something what supports this character (`chcp 65001`). – Marek R Feb 27 '23 at 09:42

How to Store Emojis in char8_t and Print Them Out in C++20?

4 Answers4