1

Please note this is not the same questions as How to open an std::fstream (ofstream or ifstream) with a unicode filename?. That question was about an unicode filename, this one is about an unicode file contents.

I need to open a UTF-8 unicode file (containing Spanish characters) with an ifstream. Under Linux this is no problem, but under Windows it is.

bool OpenSpanishFile(string filename)
{
    ifstream spanishFile;
    #ifdef WINDOWS
    spanishFile.open(filename.c_str(),ios::binary);
    #endif

    if (!spanishFile.is_open()) return false;
    spanishFile.clear();
    spanishFile.seekg(ios::beg);
    while (spanishFile.tellg()!=-1)
    {
        string line="";
        getline(spanishFile,line);
        //do stuff
        cout << line << endl;
    }
    return true;

}

I compile it under Linux with:

i586-mingw32msvc-g++ -s -fno-rtti test.cpp test.exe

And then run it in wineconsole test.exe.

The output contains all kinds of weird characters, so it tries to open the unicode file as something different.

I have searched the internet a lot about how to open a unicode file this way, but I couldn't get it to work.

Does anyone know a solution that does work with mingw? Thank you so much in advance.

Community
  • 1
  • 1
L.A. Rabida
  • 416
  • 3
  • 15
  • strange code. if the `WINDOWS` part is only included in Windows, how then does the file get opened in Linux? i.e., is this the real code. – Cheers and hth. - Alf May 28 '14 at 20:38
  • I think this is best answered by the answer to [this question on StackOverflow about UTF8 encoding wifstream](http://stackoverflow.com/questions/1274910/does-wifstream-support-different-encodings). – MrGregs May 28 '14 at 20:39
  • @MrGregs: there's no need at all to go through such complexities just to read the UTF-8 data. the problem is a mis-interpretation of the result of displaying that data via `cout`, that's all. – Cheers and hth. - Alf May 28 '14 at 20:57

1 Answers1

1

Most likely (it's unclear whether the presented code is the real code) the reason that you see garbage is that std::cout in Windows defaults to presenting its result in a non-UTF-8 console window.

To properly check whether you're reading the UTF-8 file correctly, simply collect all the input in a string, convert it from UTF-8 to UTF-16 wstring, and display that using MessageBoxW (or wide direct console output).

The following UTF-8 → UTF-16 conversion function works nicely with Visual C++ 12.0:

#include <codecvt>          // std::codecvt_utf8_utf16
#include <locale>           // std::wstring_convert
#include <string>           // std::wstring

auto wstring_from_utf8( char const* const utf8_string )
    -> std::wstring
{
    std::wstring_convert< std::codecvt_utf8_utf16< wchar_t > > converter;
    return converter.from_bytes( utf8_string );
}

Unfortunately, even though it only uses standard C++11 functionality, it fails to compile with MinGW g++ 4.8.2, but hopefully you have Visual C++ (after all it's free).


As an alternative you can code up a conversion function using the Windows API MultiByteToWideChar.

For example, the following code works nicely with g++ 4.8.2 with -D USE_WINAPI:

#undef UNICODE
#define UNICODE
#include <windows.h>
#include <shellapi.h>       // ShellAbout

#ifndef USE_WINAPI
#   include <codecvt>          // std::codecvt_utf8_utf16
#   include <locale>           // std::wstring_convert
#endif
#include <fstream>          // std::ifstream
#include <iostream>         // std::cerr, std::endl
#include <stdexcept>        // std::runtime_error, std::exception
#include <stdlib.h>         // EXIT_FAILURE
#include <string>           // std::string, std::wstring

namespace my {
    using std::ifstream;
    using std::ios;
    using std::runtime_error;
    using std::string;
    using std::wstring;

    #ifndef USE_WINAPI
        using std::codecvt_utf8_utf16;
        using std::wstring_convert;
    #endif

    auto hopefully( bool const c ) -> bool { return c; }
    auto fail( string const& s ) -> bool { throw runtime_error( s ); }

    #ifdef USE_WINAPI
        auto wstring_from_utf8( char const* const utf8_string )
            -> wstring
        {
            if( *utf8_string == '\0' )
            {
                return L"";
            }
            wstring result( strlen( utf8_string ), L'#' );  // More than enough.
            int const n_chars = MultiByteToWideChar(
                CP_UTF8,
                0,      // Flags, only alternative is MB_ERR_INVALID_CHARS
                utf8_string,
                -1,     // ==> The string is null-terminated.
                &result[0],
                result.size()
                );
            hopefully( n_chars > 0 )
                || fail( "MultiByteToWideChar" );
            result.resize( n_chars );
            return result;
        }
    #else
        auto wstring_from_utf8( char const* const utf8_string )
            -> wstring
        {
            wstring_convert< codecvt_utf8_utf16< wchar_t > > converter;
            return converter.from_bytes( utf8_string );
        }
    #endif

    auto text_of_file( string const& filename )
        -> string
    {
        ifstream f( filename, ios::in | ios::binary );
        hopefully( !f.fail() )
            || fail( "file open" );
        string result;
        string s;
        while( getline( f, s ) )
        {
            result += s + '\n';
        }
        return result;
    }

    void cpp_main()
    {
        string const    utf8_text   = text_of_file( "spanish.txt" );
        wstring const   wide_text   = wstring_from_utf8( utf8_text.c_str() );
        //ShellAbout( 0, L"Spanish text", wide_text.c_str(), LoadIcon( 0, IDI_INFORMATION ) );
        MessageBox(
            0,
            wide_text.c_str(),
            L"Spanish text",
            MB_ICONINFORMATION | MB_SETFOREGROUND
            );
    }
}  // namespace my

auto main()
    -> int
{
    using namespace std;
    try
    {
        my::cpp_main();
        return EXIT_SUCCESS;
    }
    catch( exception const& x )
    {
        cerr << "!" << x.what() << endl;
    }
    return EXIT_FAILURE;
}

enter image description here

Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • I use Ubuntu so that's why I don't have easy access to Visual C++. I did try to use a graphical messagebox in stead of cout, but with the same result. – L.A. Rabida May 28 '14 at 22:51
  • @L.A.Rabida: if you got the same result then you forgot to translate to UTF-16. if you got apparently the same result (gobbledegook) then maybe you did translate but called the ANSI version of `MessageBox` and casted the argument, or something like that. impossible to say. – Cheers and hth. - Alf May 28 '14 at 22:54
  • I did have to change the code a little bit because I got some compiler errors, but after that it worked! Fantastic! Thank you so much! – L.A. Rabida May 29 '14 at 11:50
  • Thinking about it, it's probably unnecessary to read the file as binary, and reading it as binary *may* cause undesired control characters (carriage return) in result. Now that it's apparently working you can try it with text mode reading. When/if text mode works it's absolutely preferable, to avoid having to deal explicitly with differences of Unix and Windows end-of-line conventions. – Cheers and hth. - Alf May 29 '14 at 12:50