3

I am trying to read and process multiple files that are in different encoding. I am supposed to only use STL for this. Suppose that we have iso-8859-15 and UTF-8 files.

In this SO answer it states:

In a nutshell the more interesting part for you:

  1. std::stream (stringstream, fstream, cin, cout) has an inner locale-object, which matches the value of the global C++ locale at the moment of the creation of the stream object. As std::in is created long before your code in main is called, it has most probably the classical C locale, no matter what you do afterwards.
  2. You can make sure, that a std::stream object has the desirable locale by invoking std::stream::imbue(std::locale(your_favorite_locale)).

The problem is that from the two types, only the files that match the locale that was created first are processed correctly. For example If locale_DE_ISO885915 precedes locale_DE_UTF8 then files that are in UTF-8 are not appended correctly in string s and when I cout them out i only see a couple of lines from the file.

void processFiles() {
    //setup locales for file decoding
    std::locale locale_DE_ISO885915("de_DE.iso885915@euro");
    std::locale locale_DE_UTF8("de_DE.UTF-8");
    //std::locale::global(locale_DE_ISO885915);
    //std::cout.imbue(std::locale());
    const std::ctype<wchar_t>& facet_DE_ISO885915 = std::use_facet<std::ctype<wchar_t>>(locale_DE_ISO885915);
    //std::locale::global(locale_DE_UTF8);
    //std::cout.imbue(std::locale());
    const std::ctype<wchar_t>& facet_DE_UTF8 = std::use_facet<std::ctype<wchar_t>>(locale_DE_UTF8);

    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
    std::string currFile, fileStr;
    std::wifstream inFile;
    std::wstring s;

    for (std::vector<std::string>::const_iterator fci = files.begin(); fci != files.end(); ++fci) {
        currFile = *fci;

        //check file and set locale
        if (currFile.find("-8.txt") != std::string::npos) {
            std::locale::global(locale_DE_ISO885915);
            std::cout.imbue(locale_DE_ISO885915);
        }
        else {
            std::locale::global(locale_DE_UTF8);
            std::cout.imbue(locale_DE_UTF8);
        }

        inFile.open(path + currFile, std::ios_base::binary);
        if (!inFile) {
            //TODO specific file report
            std::cerr << "Failed to open file " << *fci << std::endl;
            exit(1);
        }

        s.clear();
        //read file content
        std::wstring line;
        while( (inFile.good()) && std::getline(inFile, line) ) {
            s.append(line + L"\n");
        }
        inFile.close();

        //remove punctuation, numbers, tolower...
        for (unsigned int i = 0; i < s.length(); ++i) {
            if (ispunct(s[i]) || isdigit(s[i]))
                s[i] = L' ';
        }

        if (currFile.find("-8.txt") != std::string::npos) {
            facet_DE_ISO885915.tolower(&s[0], &s[0] + s.size());
        }
        else {
            facet_DE_UTF8.tolower(&s[0], &s[0] + s.size());
        }
        fileStr = converter.to_bytes(s);


        std::cout << fileStr << std::endl;
        std::cout << currFile << std::endl;
        std::cout << fileStr.size() << std::endl;
        std::cout << std::setlocale(LC_ALL, NULL) << std::endl;
        std::cout << "========================================================================================" << std::endl;
        // Process...
    }
    return;
}

As you can see in the code, I have tried with global and locale local variables but to no avail.

In addition, in How can I use std::imbue to set the locale for std::wcout? SO answer it states:

So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale to allow imbue conversion to work correctly.

Is this "obscure" mechanism the problem here?

Is it possible to alternate between the two locales while processing the files? What should I imbue (cout, ifstream, getline ?) and how?

Any suggestions?

PS: Why is everything related with locale so chaotic? :|

BugShotGG
  • 5,008
  • 8
  • 47
  • 63
  • "Why is everything related with locale so chaotic?" [Overengineering](https://en.wikipedia.org/wiki/Overengineering) at its finest. – Eljay Apr 15 '18 at 11:18
  • @Eljay Somehow, there must be a workaround for such a trivial task... – BugShotGG Apr 15 '18 at 14:41
  • I'll see if I can get your code to work, but truth be told I gave up with C++ and did my "multiple text files in a wide variety of encodings" work in Python 3. – Eljay Apr 15 '18 at 14:43
  • @Eljay I could use python to transform every file to utf8 encoding and then process files in c++ as utf8, but that would be a bit verbose way of solving the problem – BugShotGG Apr 15 '18 at 15:43
  • The easiest way would be to forget that locales ever exist in C++ and use a third party library such as libiconv. – n. m. could be an AI Apr 15 '18 at 15:53
  • @n.m. Or maybe ICU but I am limited in STL for now... – BugShotGG Apr 15 '18 at 16:05

1 Answers1

3

This works for me as expected on my Linux machine, but not on my Windows machine under Cygwin (the set of available locales is apparently the same on both machines, but std::locale::locale just fails with every imaginable locale string).

#include <iostream>
#include <fstream>
#include <locale>
#include <string>

void printFile(const char* name, const char* loc)
{
  try {
    std::wifstream inFile;
    inFile.imbue(std::locale(loc));
    inFile.open(name);
    std::wstring line;
    while (getline(inFile, line))
      std::wcout << line << '\n';
  } catch (std::exception& e) {
    std::cerr << e.what() << std::endl;
  }
}

int main()
{
  std::locale::global(std::locale("en_US.utf8"));

  printFile ("gtext-u8.txt", "de_DE.utf8");       // utf-8 text: grüßen
  printFile ("gtext-legacy.txt", "de_DE@euro");   // iso8859-15 text: grüßen
}

Output:

grüßen
grüßen
n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • On my platform, the locales were `"en_US.UTF-8", "de_DE.UTF-8", "de_DE.ISO8859-15"`, but your code worked for me. – Eljay Apr 15 '18 at 19:39
  • @n.m. Thank you for answering. Reading about locales in `The C++ programming language` along with your example made me realise how to treat locales. Using a function to handle scope was also nice. Now I can see why some people said `std::locale::global();` is really bad. It may be useful only for `std::cout`. – BugShotGG Apr 15 '18 at 20:30
  • @Nik-Lz the [Ask Question](https://stackoverflow.com/questions/ask) button is up there near the top of the page. – n. m. could be an AI Aug 12 '18 at 09:06
  • I must have a weird setup as for me this program compiles but does not give any output. As far as I can see there is no output after the `std::locale::global(std::locale("en_US.utf8"));` – albert Feb 21 '21 at 16:48
  • @albert Yes, Windows is weird. – n. m. could be an AI Feb 21 '21 at 20:32