std::string is natively encoded in UTF-8 but char can not hold utf characters?

Question

After reading std::wstring VS std::string, I was under the impression that for Linux, I don't need to worry about using any wide character facilities of the language.
*things like: std::wifstream, std::wofstream, std::wstring, whar_t, etc.

This seems to go fine when I'm using only std::strings for the non-ascii characters, but not when I'm using chars to handle them.

For example: I have a file with just a unicode checkmark in it.
I can read it in, print it to the terminal, and output it to a file.

// ✓ reads in unicode to string
// ✓ outputs unicode to terminal
// ✓ outputs unicode back to the file
#include <iostream>
#include <string>
#include <fstream>

int main(){
  std::ifstream in("in.txt");
  std::ofstream out("out.txt");

  std::string checkmark;
  std::getline(in,checkmark); //size of string is actually 3 even though it just has 1 unicode character

  std::cout << checkmark << std::endl;
  out << checkmark;

}

The same program does not work however, if I use a char in place of the std::string:

// ✕ only partially reads in unicode to char
// ✕ does not output unicode to terminal
// ✕ does not output unicode back to the file
#include <iostream>
#include <string>
#include <fstream>

int main(){
  std::ifstream in("in.txt");
  std::ofstream out("out.txt");

  char checkmark;
  checkmark = in.get();

  std::cout << checkmark << std::endl;
  out << checkmark;

}

nothing appears in the terminal(apart from a newline).
The output file contains â instead of the checkmark character.

Since a char is only one byte, I could try to use a whar_t, but it still does not work:

// ✕ only partially reads in unicode to char
// ✕ does not output unicode to terminal
// ✕ does not output unicode back to the file
#include <iostream>
#include <string>
#include <fstream>

    int main(){
      std::wifstream in("in.txt");
      std::wofstream out("out.txt");

      wchar_t checkmark;
      checkmark = in.get();

      std::wcout << checkmark << std::endl;
      out << checkmark;

    }

I've also read about setting the following locale, but it does not appear to make a difference.

setlocale(LC_ALL, "");

I wonder why you even think that reading a `char` can lead to the same behavior... — leemes, Aug 20 '14 at 01:43
@leemes How do I read one multi-byte character at a time then? I'm just expected to read it all into a string and deal with it? Should I be using wchar_t under linux? whar_t does not appear to hold the character either. — Trevor Hickey, Aug 20 '14 at 01:50
To work with utf-8 encoded text you need to get rid of the idea that one byte = one visible character. The visible character may be anywhere from 1 to 4 bytes of data. A `std::string` can contain the data with respect to encoding. You'll be best served by using a library to do that. — Retired Ninja, Aug 20 '14 at 02:10
Even more so with Unicode's combining characters. A base character can be followed by any number of combining "characters" that comprise a grapheme. (The order of multiple combining characters has no meaning.) Some graphemes (ä) can even be represented in one codepoint ("\u00E4") or multiple ("\u0061\u0308"). So, when reading text, it is not enough to read a complete codepoint; You have to read ahead up to the next non-combining character or end-of-file. — Tom Blodget, Aug 20 '14 at 02:51
It sounds like you don't realize that `checkmark.size() > 1` in the first example. — Bill Lynch, Aug 20 '14 at 03:11

John Zwinck · Accepted Answer · 2014-08-20T02:10:53.477

3

In the std::string case you read one line, which in our case contains a multi-byte Unicode character. In the char case you read a single byte, which is not even a single complete character.

Edit: for UTF-8 you should read into an array of char. Or just std::string since that already works.

edited Aug 20 '14 at 02:10

answered Aug 20 '14 at 01:42

John Zwinck

239,568
38
324
436

So I'm suppose to be using a wchar_t then? – Trevor Hickey Aug 20 '14 at 01:53
3

No, you just need to pay closer attention to what you are reading. Non-ASCII characters require more than 1 byte in UTF-8. `std::getline()` handles that (assuming the file is UTF-8 encoded), but `ifstream::get()` does not. – Remy Lebeau Aug 20 '14 at 02:04

std::string is natively encoded in UTF-8 but char can not hold utf characters?

1 Answers1