1

I have been struggling for some time with trying to extract an int from a UTF8 file:

#include <iostream>
#include <fstream>
#include <sstream>

using namespace std;

int main()
{
    ifstream file("UTF8.txt");
    if(file.is_open())
    {
        string line;
        getline(file, line);
        istringstream ss(line);
        int a;
        ss >> a;
        if(ss.fail())
        {
            cout << "Error parsing" << endl;
            ss.clear();
        }
        getline(file, line);
        cout << a << endl << line << endl;
        file.close();
    }
}

The file contains 2 lines: "42" and "è_é", and is saved in Notepad as UTF8. The above works when the file is ANSI, but fails when it is Unicode. I've tried a number of things, the most promising one being to set the locale, but I would like the program to be independent on the locale of the computer (i.e. read chinese characters even if the PC is a US one). Honestly, I'm out of ideas now. I'd like to avoid using CStrings from Qt if possible.

UPDATE

The following displays "0", "Error parsing" because of one weird character at the very beginning of the file. An empty line, discarded when read, just before the number makes it work but I can't change the file in the final program. Accents are not displayed properly in the console, but when I write the output to a file all is well and that's all I need. So it's only that issue with the beginning of the file!

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <sstream>

int main()
{
    std::ifstream file("UTF8.srt");
    file.imbue(std::locale(file.getloc(),
        new std::codecvt_utf8<wchar_t,0x10ffff,std::consume_header>));
    if (file.is_open()) {
        std::string line;
        std::getline(file,line);
        std::istringstream ss{line};
        int a;
        ss >> a;
        if (ss.fail()) {
            std::cout << "Error parsing" << std::endl;
            ss.clear();
        }
        getline(file,line);
        std::cout << a << std::endl << line << std::endl;
        file.close();
    }
}

SOLUTION

The following works, with the input file content as follows:

5
bla bla é_è

6
truc è_é

Code:

#include <cstdint>
#include <iostream>
#include <fstream>
#include <sstream>

// Do not get used to it:
// using namespace std;

inline const char* skip_utf8_bom(const char* s, std::size_t size)
{
    if(3 <= size && s[0] == char(0xEF) && s[1] == char(0xBB) && s[2] == char(0xBF))
        s += 3;
    return s;
}

int main()
{
    std::ifstream file("UTF8.txt");
    std::ofstream fileO("UTF8_copy.txt");
    if(!file || !fileO) {
        std::cout << "Error opening files" << std::endl;
    }
    else {
        std::string line;

        //Parse the first number
        std::getline(file, line);
        {
            const char* linePtr = skip_utf8_bom(line.c_str(), line.size());
            std::istringstream input(linePtr);
            int a = -1;
            input >> a;
            if( ! input) {
                std::cout << "Error parsing" << std::endl;
            }
            std::cout << "Number 1: " << a << std::endl;
            fileO << a << std::endl;
        }

        //Copy the following line as is
        std::getline(file, line);
        fileO << line << std::endl;

        //Discard empty line, copy it in the output file
        std::getline(file, line);
        fileO << std::endl;

        //Parse the second number
        std::getline(file, line);
        {
            const char* linePtr = skip_utf8_bom(line.c_str(), line.size());
            std::istringstream input(linePtr);
            int a = -1;
            input >> a;
            if( ! input) {
                std::cout << "Error parsing" << std::endl;
            }
            std::cout << "Number 1: " << a << std::endl;
            fileO << a << std::endl;
        }

        //Copy the following line as is
        std::getline(file, line);
        fileO << line << std::endl;

        file.close();
        fileO.close();
    }

    return 0;
}
Mister Mystère
  • 952
  • 2
  • 16
  • 39
  • 1
    Open the file in a hex editor - there might be a BOM for UTF8 –  Mar 17 '16 at 12:13
  • What do you mean? What's a BOM? – Mister Mystère Mar 17 '16 at 12:14
  • Byte Order Mark: https://en.wikipedia.org/wiki/Byte_order_mark –  Mar 17 '16 at 12:15
  • check notepad++ text editor(can easy check/convert to any format, and overall great editor), utf-8 text file for Windows should contain BOM,as @MisterMystère said. – jonezq Mar 17 '16 at 12:23
  • To resolve the parsing error in the updated code; change `ifstream` to `wifstream`, `string` to `wstring` and `istringstream` to `wistringstream`. – wally Mar 17 '16 at 13:34
  • Already done, and in fact I was wrong, accents and such are not correctly displayed. Back to square one... Have you tested your code? – Mister Mystère Mar 17 '16 at 13:36
  • You need only one `skip_utf8_bom` at the beginning of the file –  Mar 17 '16 at 14:23

2 Answers2

2

Read the file with std::codecvt_mode

Example from the link above:

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

int main()
{
    // UTF-8 data with BOM
    std::ofstream("text.txt") << u8"\ufeffz\u6c34\U0001d10b";
    // read the UTF8 file, skipping the BOM
    std::wifstream fin("text.txt");
    fin.imbue(std::locale(fin.getloc(),
                          new std::codecvt_utf8<wchar_t, 0x10ffff, std::consume_header>));
    for (wchar_t c; fin.get(c); )
        std::cout << std::hex << std::showbase << c << '\n';
}

Note the std::consume_header setting.

Adapted to your question it might be:

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <sstream>

int main()
{
    std::ifstream file("UTF8.txt");
    file.imbue(std::locale(file.getloc(),
        new std::codecvt_utf8<char,0x10ffff,std::consume_header>));
    if (file.is_open()) {
        std::string line;
        std::getline(file,line);
        std::istringstream ss{line};
        int a;
        ss >> a;
        if (ss.fail()) {
            std::cout << "Error parsing" << std::endl;
            ss.clear();
        }
        getline(file,line);
        std::cout << a << std::endl << line << std::endl;
        file.close();
    }
}

Or with wchar_t:

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <sstream>

int main()
{
    std::wifstream file("UTF8.txt");
    file.imbue(std::locale(file.getloc(),
        new std::codecvt_utf8<wchar_t,0x10ffff,std::consume_header>));
    if (file.is_open()) {
        std::wstring line;
        std::getline(file,line);
        std::wistringstream ss{line};
        int a;
        ss >> a;
        if (ss.fail()) {
            std::wcout << L"Error parsing" << std::endl;
            ss.clear();
        }
        std::getline(file,line);
        std::wcout << a << std::endl << line << std::endl;
        file.close();
    }
}
wally
  • 10,717
  • 5
  • 39
  • 72
  • Thanks - but I can't compile it, it says codecvt: no such file or directory. Using CodeBlocks & MingW on Windows 64 bits. – Mister Mystère Mar 17 '16 at 12:24
  • @MisterMystère Then your compiler/MinGW is misconfigured. – Konrad Rudolph Mar 17 '16 at 12:27
  • Interesting, didn’t know about this option. I find it puzzling that you’d open the file as a `wchar_t` stream though. Wouldn’t the same work with `char` (keeping in mind that the recommendation is to use [UTF-8 everywhere](http://utf8everywhere.org/))? – Konrad Rudolph Mar 17 '16 at 12:28
  • @KonradRudolph: everything else works, it's only codecvt. And I've added the C++11 flag. I've been on that for 2 days, really starting to become desperate. – Mister Mystère Mar 17 '16 at 12:30
  • 1
    @MisterMystère Good point, looks like MinGW is outdated; the header was added very recently to libstdc++: http://stackoverflow.com/questions/15615136/is-codecvt-not-a-std-header — You might have to switch to MinGW-64 or Cygwin, as MinGW seems to be woefully understaffed. As far as I can see, they won’t ship a recent GCC for the foreseeable future. – Konrad Rudolph Mar 17 '16 at 12:33
  • Good catch. I've downloaded the latest MinGW and set up Codeblocks accordingly. I tried to adapt the code to extract the int I require, and it does not work. I have updated my post; what do you think? – Mister Mystère Mar 17 '16 at 12:46
  • Thanks for the edit. Your code fails to compile because "undefined reference to std::codecvt_utf8 [...]"... WHAT IS GOING ON. Sigh. It's certainly not your code though, something's fishy. – Mister Mystère Mar 17 '16 at 12:59
  • OK I changed "char" to wchar_t and it compiles now - also changed the name for "UTF8.srt". "Error parsing" is displayed, although it successfully reads the next line. That's encouraging? – Mister Mystère Mar 17 '16 at 13:03
  • I see what you mean about the compile error. In visual studio 2015 it works fine, but on clang and g++ [it doesn't](http://coliru.stacked-crooked.com/a/723ff2b29932e854). – wally Mar 17 '16 at 13:09
  • As soon as I changed char to wchar_t it compiled just fine. Now, the second line is OK but the number is not parsed correctly... – Mister Mystère Mar 17 '16 at 13:14
  • [This works](http://coliru.stacked-crooked.com/a/e7b7e7627436c0a4), but I think the website doesn't show unicode text. – wally Mar 17 '16 at 13:22
  • I think it works even without the locale (I was fooled by the console terminal which apparently can't display unicode characters, but writing to files works), it looks like the only issue was that leading BOM. I accepted the other answer, but +1'd you for all the help - after all it is thanks to the investigation we've led together that I was able to see it had something to do with a "weird character at the beginning of the file". – Mister Mystère Mar 17 '16 at 14:47
2

Just skip the leading BOM (Byte Order Mark):

#include <cstdint>
#include <iostream>
#include <fstream>
#include <sstream>

// Do not get used to it:
// using namespace std;

inline const char* skip_utf8_bom(const char* s, std::size_t size)
{
    if(3 <= size && s[0] == char(0xEF) && s[1] == char(0xBB) && s[2] == char(0xBF))
        s += 3;
    return s;
}


int main()
{
    std::istringstream file(u8"\xEF\xBB\xBF""42\n\u00E8_\u00E9\n");
    std::string line;
    getline(file, line);
    const char* linePtr = skip_utf8_bom(line.c_str(), line.size());
    std::istringstream input(linePtr);
    int a = -1;
    input >> a;
    if( ! input) {
        std::cout << "Error parsing" << std::endl;
    }
    getline(file, line);
    std::cout << a << std::endl << line << std::endl;
}