0

I am trying to read a utf8 content to char*, my file does not have any DOM, so the code is straight, (the file is unicode punctuation)

char* fileData = "\u2010\u2020";

I cannot see how a single unsigned char 0 > 255 can contain a character of value 0 > 65535 so I must be missing something.

...
std::ifstream fs8("../test_utf8.txt");
if (fs8.is_open()) 
{
  unsigned line_count = 1;
  std::string line;
  while ( getline(fs8, line)) 
  {
    std::cout << ++line_count << '\t' << line << L'\n';
  }
}
...

So how can I read a utf8 file into a char*, (or even a std::string)

Simon Goodman
  • 1,174
  • 1
  • 8
  • 35

1 Answers1

0

well, you ARE reading the file correctly into std::string and std::string do support UTF8, it's probably that your console * which cannot show non-ASCII character.

basically, when a character code page is bigger than CHAR_MAX/2, you simply represent this character with many character. how and how many characters? this is what encoding is all about. UTF32 for example, will show each character, ASCII and non ASCII as 4 characters. hence the "32" (each byte is 8 bit, 4*8 = 32).

without providing any auditional information on what OS you are using, we can't give a an advice on how your program can show the file's line.

*or more exactly, the standard output which will probably be implemented as console text.

David Haim
  • 25,446
  • 3
  • 44
  • 78
  • I am not trying to output the code to console, but rather I am passing it to pcre, if I do it directly 'char* fileData = "\u2010\u2020";' then I can use '/p{P}' if I try to read the exact same data, the value is garbage and the regex no longer work. – Simon Goodman Sep 15 '15 at 20:25
  • I am using Visual studio 2015, but I am using straight C++11, nothing 'window' specific. – Simon Goodman Sep 15 '15 at 20:26
  • the value is not garbage, it's utf8 encoded. you can google more about and utf8. read more about here, for example here: http://stackoverflow.com/questions/11254232/do-c11-regular-expressions-work-with-utf-8-strings – David Haim Sep 15 '15 at 20:28
  • @SimonGoodman: You cannot assign a `"\u2010\u2020"` string literal to a `char*` like you have shown. You have to use `wchar_t*` instead: `wchar_t *filedata = L"\u2010\u2020";`. In any case, if your PCRE library supports UTF-8 strings, you would simply read the UTF-8 data (UTF-8 is 8bit - hence its name - so can fit in `char` elements) from your file into `std::string`, just as you have shown. `getline()` will read the *raw* file data into `std::string`, and then you can use `std::string::c_str()` to pass the data to the PCRE library if it is expecting `char*` input. – Remy Lebeau Sep 15 '15 at 20:28
  • @RemyLebeau theoretically speaking if the string literal is valid utf8 he can assign it to `const char*` with the `u8` prefix. – David Haim Sep 15 '15 at 20:31
  • @SimonGoodman: FWIW, `"\u2010\u2020"` is `"‐†`, which is encoded in UTF-8 using `char` values as `"\xE2\x80\x90\xE2\x80\xA0"`. – Remy Lebeau Sep 15 '15 at 20:31
  • But if I 'read' "\xE2\x80\x90\xE2\x80\xA0" from the file, then it is no surprise that the pcre does not find a match. Only a "\x2010" would be a match. So I need to read into a wchar_t* and pass that to pcre, not a char* – Simon Goodman Sep 15 '15 at 21:00
  • PCRE can operate directly on the UTF-8 encoded string if you use the right flag. It is smart enough to figure out that those three bytes `"\xE2\x80\x90"` are one character. – roeland Sep 16 '15 at 06:31