8

I'm having a hard time to parse an xml file.

The file was saved with UTF-8 Encoding.

Normal ASCII are read correctly, but Korean characters are not.

So I made a simple program to read a UTF-8 text file and print the content.

Text File(test.txt)

ABC가나다

Test Program

#include <fstream>
#include <iostream>
#include <string>
#include <iterator>
#include <streambuf>

const char* hex(char c) {
    const char REF[] = "0123456789ABCDEF";
    static char output[3] = "XX";
    output[0] = REF[0x0f & c>>4];
    output[1] = REF[0x0f & c];
    return output;
}

int main() {
    std::cout << "File(ifstream) : ";
    std::ifstream file("test.txt");
    std::string buffer((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
    for (auto c : buffer) {
        std::cout << hex(c)<< " ";
    }
    std::cout << std::endl;
    std::cout << buffer << std::endl;

    //String literal
    std::string str = "ABC가나다";
    std::cout << "String literal : ";
    for (auto c : str) {
        std::cout << hex(c) << " ";
    }
    std::cout << std::endl;
    std::cout << str << std::endl;

    return 0;
}

Output

File(ifstream) : 41 42 43 EA B0 80 EB 82 98 EB 8B A4
ABC媛?섎떎
String literal : 41 42 43 B0 A1 B3 AA B4 D9
ABC가나다

The output said that characters are encoded differently in string literal and file.

So far as I know, in c++ char strings are encoded in UTF-8 so we can see them through printf or cout. So their bytes were supposed to be same, but they were different actually...

Is there any way to read UTF-8 text file using std::ifstream?


I succeed to parse xml file using std::wifstream following this article.

But most of the libraries I'm using are supporting only const char* string so I'm searching for another way to use std::ifstream.

And also I've read this article saying that do not use wchar_t. Treating char string as multi-bytes character is sufficient.

Community
  • 1
  • 1
JaeJun LEE
  • 1,234
  • 3
  • 11
  • 27
  • You should [`imbue()`](http://en.cppreference.com/w/cpp/io/basic_ios/imbue) a UTF-8 locale into the `std::ifstream` before reading the file data. You also need to `imbue()` a UTF-8 locale into `std::cout`, and/or set your terminal's charset to UTF-8. Your `ifstream` output is correct for UTF-8 (the UTF-8 encoded form of `ABC가나다` really is 12 bytes). Your string literal example does not produce the correct output, because it is subject to the charset that you saved your source code file as, as well as the charset of your terminal, neither of which are using UTF-8. – Remy Lebeau Apr 10 '17 at 20:49
  • This does not make sense. If a file is encoded in UTF8, and if you want to read it in 8bit characters (`std::string` of `char) *as UTF8*, you just have to read the characters with no conversion. What are you exactly trying to achieve? – Serge Ballesta Apr 10 '17 at 21:18
  • If you are on Windows you may have to open the files in *binary* mode to prevent certain character conversions. I've never had a problem reading `UTF-8` with file streams. – Galik May 15 '17 at 22:20
  • @RemyLebeau MSVC Runtime does not support Unicode locales, so the only way to get UTF-8 locale object to imbue stream with is through using Boost.Locale which is way too much for such simple task. –  Nov 22 '17 at 13:23

1 Answers1

3

Encoding "ABC가나다" using UTF-8 should give you

"\x41\x42\x43\xEA\xB0\x80\xEB\x82\x98\xEB\x8B\xA4"

so the content of the file you got is correct. The problems is with your source file encoding. You are not allowed to use non-ascii symbols in string literals like that, you should prefix them with u8 to get UTF-8 literal:

u8"ABC가나다"

At this point I assume you are using Windows, otherwise you wouldn't have any issues with encodings. You will have to change your terminals character set to UTF-8:

chcp 65001

What is happening in your case is that you are reading UTF-8 text from a file to a string, then printing it to non-unicode terminal which is unable to show it as you expect. When you are printing your string literal, you are printing non-unicode sequence, but this sequences enconding matches your terminal encoding, so you can see what you expected.

PS: I used https://mothereff.in/utf-8 to get UTF-8 represenation of your string in hex.