3

I've faced an issue and couldn't find an answer on the internet. Even though I found many similar questions, none of the answers worked for me. I'm using Visual Studio 2015 on Windows 10.

So part of my code is:

wstring books[50];
wstring authors[50];
wstring genres[50];
wstring takenBy[50];
wstring additional;
bool taken[50];
_setmode(_fileno(stdout), _O_U8TEXT);
wifstream fd("bookList.txt");
i = 0;
while (!fd.eof())
{
    getline(fd, books[i]);
    getline(fd, authors[i]);
    getline(fd, genres[i]);
    getline(fd, takenBy[i]);
    fd >> taken[i];
    getline(fd, additional);
    i++;
}

What I need, is to read a text file encoded in UTF-8 with C++. But, when I read the file, those wide strings are changed and when I print them, the output text is absolutely different.

Input:

ąčę

Output:

ÄÄÄ


How do I avoid it and read the text correctly?

informatik01
  • 16,038
  • 10
  • 74
  • 104
Kęstutis
  • 48
  • 5
  • 1
    So that we are not doomed to simply repeat all the answers that "didn't work for you", perhaps you can expand on why existing solutions are not applicable? – Lightness Races in Orbit Jul 02 '17 at 16:08
  • @LightnessRacesinOrbit Well, as I told, I need to get and print UTF-8 characters from a file, but as I tried out the answers and advice given in other places, they either didn't work at all or didn't do what I needed it to do, for example, as I have already written, strings lose UTF-8 characters and I want to know how I could avoid that loss. – Kęstutis Jul 02 '17 at 16:12
  • 3
    Update your question don't post a comment please. And can you state what platform you are running on? Unicode still has some platform dependencies. – Richard Critten Jul 02 '17 at 16:14
  • 2
    If you are using Windows console, see https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how/388500#388500 – Michael Jul 02 '17 at 16:17
  • 1
    It's OT, but please note that `while (!fd.eof())` could lead to [unexpected results](https://stackoverflow.com/a/26557243/4944425). – Bob__ Jul 02 '17 at 16:27
  • @Bob__ Thank you for advice! – Kęstutis Jul 02 '17 at 16:29

2 Answers2

6

UTF-8 is (probably) not in wide strings. Read about UTF-8 everywhere. UTF-8 use 8 bits bytes (sometimes several of them) to encode Unicode characters. So in C++ an unicode character is parsed from a sequence of 1 to 6 bytes (i.e. char-s).

You need some UTF-8 parser and the C11 or C++11 standards don't provide any. So you need some external library. Look into libunistring (which is a simple C UTF-8 parsing library) or something else (Qt, POCO, Glib, ICU, ...). You could decide to parse and convert UTF-8 into wide UTF-32 (using u32string-s and char32_t) and backwards, or you'll better decide to work internally in UTF-8 (using std::string and char)

Hence you'll parse and print sequences of char-s (using UTF-8 encoding) and your program would use plain std::string-s and plain char-s (not std::wstring or wchar_t) but process UTF-8 sequences ...

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
2

This is easy with Boost.Spirit:

#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>
#include <iostream>
#include <string>

using namespace boost::spirit;

int main()
{
    std::string in("ąčę");
    std::string out;
    qi::parse(in.begin(), in.end(), +unicode::char_, out);
    std::cout << out << std::endl;
}

The following example reads a sequence of tuples (book, authors, takenBy):

#define BOOST_SPIRIT_UNICODE
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted/std_tuple.hpp>
#include <iostream>
#include <string>
#include <tuple>
#include <vector>

using namespace boost::spirit;

int main()
{
    std::string in("Book_1\nAuthors_1\nTakenBy_1\n"\
                   "Book ąčę\nAuthors_2\nTakenBy_2\n");
    std::vector<
        std::tuple<
            std::string, /* book */
            std::string, /* authors */
            std::string  /* takenBy */
        > 
    > out;
    auto ok = qi::parse(in.begin(), in.end(),
                        *(
                               +(unicode::char_ - qi::eol) >> qi::eol /* book */
                            >> +(unicode::char_ - qi::eol) >> qi::eol /* authors */
                            >> +(unicode::char_ - qi::eol) >> qi::eol /* takenBy */
                        ),
                        out);
    if(ok)
    {
        for(auto& entry : out)
        {
            std::string book, authors, takenBy;
            std::tie(book, authors, takenBy) = entry;
            std::cout << "book: "    << book    << std::endl
                      << "authors: " << authors << std::endl
                      << "takenBy: " << takenBy << std::endl;
        }
    }
}

It's only a demo using std::tuple and an unnamed parser, which is the third parameter of qi::parse. You can use a struct instead of the tuple to represent books, authors, genres and etc. The unnamed parser may be replaced by a grammar and you can read the content of the file into a string to be passed to qi::parse.

Cosme
  • 191
  • 2
  • 6