4

I am a bit confused since I have opened a question, I would like to be a bit more specific here.

I have numerous files that contain German letters mostly in iso-8859-15 or UTF-8 encoding. In order to process them it is mandatory to transform all letters to lowercase.

For example I have a file (encoded in iso-8859-15 ) that contains:

Dr. Rose in M. Das sogen. Baptisterium zu Winland, eins der im Art. "Baukunst" (S. 496) erwähnten Rundgebäude in Grönland, soll nach Palfreys "History of New England" eine von dem Gouverneur Arnold um 1670 erbaute Windmühle sein. Vgl. Gust. Storm in den "Jahrbüchern der königlichen Gesellschaft für nordische Altertumskunde in Kopenhagen" 1887, S. 296.

Ää Öö Üü ẞß Örebro

Text Ää Öö Üü ẞß Örebro should become: ää öö üü ßß örebro.

However, tolower() does not seem to apply on capital letters such as Ä, Ö, Ü, ẞ eventhough i tried forcing locale as mentioned in this SO post

Here is the same code as posted in my other question:

std::vector<std::string> tokens;
std::string filename = "10223-8.txt";
//std::string filename = "test-UTF8.txt";
std::ifstream inFile;

//std::setlocale(LC_ALL, "en_US.iso88591");
//std::setlocale(LC_ALL, "de_DE.iso88591");
//std::setlocale(LC_ALL, "en_US.iso88591");
//std::locale::global(std::locale(""));

inFile.open(filename);
if (!inFile) { std::cerr << "Failed to open file" << std::endl; exit(1); }

std::string s = "";
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
    s.append(line + "\n");
}
inFile.close();

std::cout << s << std::endl;

//std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
    if (std::ispunct(s[i]) || std::isdigit(s[i]))
            s[i] = ' ';
    if (std::isupper(s[i]))
            s[i] = std::tolower(s[i]);
            //s[i] = std::tolower(s[i]);
            //s[i] = std::tolower(s[i], std::locale("de_DE.utf8"))
}

std::cout << s << std::endl;

//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};

//PROCESS TOKENS...

Its really frustrating and there are not many paradigms regarding the usage of <locale>.

So, apart from the main problem with my code, here are some questions:

  1. Do I have to apply some sort of custom locale in other functions too (isupper(), ispunct()...)?
  2. Do I need de_DE locale enabled or installed in my linux env to process string's chars correctly?
  3. Is it safe to process, in the same way, text as std::string that is extracted from files with different encoding (iso-8859-15 or UTF-8)?

EDIT: Konrad Rudolph answer works fine only for UTF-8 files. It does not work for iso-8859-15 which translates to the initial problem posted here: How to apply functions on text files with different encoding in c++

BugShotGG
  • 5,008
  • 8
  • 47
  • 63
  • You can't some letters don't have a lowercase representation. Some do but don't round-trip. – Richard Critten Apr 08 '18 at 14:49
  • I've had best luck making my own translation tables. If I were you, I'd translate ISO 8859-15 into Unicode. In your code, you seem to have tried to use ISO 8859-1 in various permutations. ISO 8859-15 is not ISO 8859-1. – Eljay Apr 08 '18 at 14:53
  • @RichardCritten Do the ones that I mention have both upper-lower case? – BugShotGG Apr 08 '18 at 14:53
  • @Eljay I was testing other textfiles too that had `ISO 8859-1`. Its a mess... – BugShotGG Apr 08 '18 at 14:54
  • 2
    C++ as a standard sort of punts when it comes to encodings. The `std::locale` is at the mercy of what the operating system provides, and (in my opinion) is a bit janky. I try to stick with Unicode; UTF-8 mostly, sometimes WTF-8, UTF-16, or UTF-32 but I convert to UTF-8 as fast as I can. Any other encodings I also translate into Unicode (UTF-8) as near the edge as I can. I've had good experiences using IBM's ICU. And, I admit, sometimes I pre-scrub the data using Python 3.x into Unicode. – Eljay Apr 08 '18 at 14:59
  • Where exactly does the idea of passing an extra `std::locale` parameter to `std::tolower()` come from? I am unable to find anything like that documented anywhere. `std::tolower` takes one parameter. The End. – Sam Varshavchik Apr 08 '18 at 15:09
  • @Eljay well from other posts in SO. I was also wondering about it but seems that it works. Note that there is `` and `` – BugShotGG Apr 08 '18 at 19:50
  • Locales might not be set up properly on your system. You should try the solution to the following question: https://stackoverflow.com/questions/19100708 – N00byEdge Apr 08 '18 at 14:54
  • Do you think that I cannot use `de_DE` locale in c++ program because its not installed in my system `locale`? – BugShotGG Apr 08 '18 at 20:04
  • Either that or that `LC_ALL` is not available for setting. – N00byEdge Apr 08 '18 at 20:23
  • I have posted my locale here: https://stackoverflow.com/questions/49705874/how-to-apply-cctype-functions-on-text-files-with-different-encoding-in-c – BugShotGG Apr 08 '18 at 20:26
  • @SamVarshavchik [Cough](http://en.cppreference.com/w/cpp/locale/tolower) – Konrad Rudolph Apr 09 '18 at 14:43
  • @SamVarshavchik Any ideas about the problem? – BugShotGG Apr 09 '18 at 14:45
  • You’re using the wrong `tolower`. Have a look at [the example on cppreference](http://en.cppreference.com/w/cpp/locale/ctype/tolower). But beware; the code there doesn’t work on every system. On my macOS, for instance, it either crashes due to an unknown locale name, or mangles the output result. – Konrad Rudolph Apr 09 '18 at 14:50
  • @KonradRudolph Thanks! Trying to augment the example from cpprefence. Lets hope that it will work... – BugShotGG Apr 09 '18 at 14:58

1 Answers1

1

Use std::ctype::tolower, not std::tolower:

#include <iostream>
#include <locale>

int main() {
    std::locale::global(std::locale("de_DE.UTF-8"));
    std::wcout.imbue(std::locale());
    auto& f = std::use_facet<std::ctype<wchar_t>>(std::locale());
    std::wstring str = L"Ää Öö Üü ẞß Örebro";
    f.tolower(&str[0], &str[0] + str.size());
    std::wcout << "'" << str << "'\n";
}

Rather than setting a global locale, you could also create a local locale (heh):

std::locale loc("de_DE.UTF-8");
std::wcout.imbue(loc);
auto& f = std::use_facet<std::ctype<wchar_t>>(loc);

This compiles and “works”. On my system, it correctly converts the umlauts but it fails to handle the capital-ß (not surprisingly, to be honest).

Furthermore, note the limitations of this function: it can only perform 1-to-1 character conversions. In previous versions of the Unicode standard, the correct uppercase transformation of “ß” was “SS”. std::ctype::toupper explicitly never supported this.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • Thanks. Your code works fine! It even transforms `ẞß` to `ßß`. Now I just have to find a way to make this work with streams and input file string. – BugShotGG Apr 09 '18 at 18:19
  • Btw, the option of creating a local locale does not seem to work. In addition, since there are files that are in iso-8859-15 how would I make the above example work for both UTF-8 and iso-8859-15? Only by knowing file encoding beforehand or is it possible via some other method? – BugShotGG Apr 10 '18 at 05:49
  • @BugShotGG In general, yes, you need to know the file encoding. Of course you can then decide *at runtime* on a per-file basis. If you don’t know how the individual files are encoded (always bad), you can simply try both locales for each file and see which one gives better results (how “better” is defined isn’t trivial though). – Konrad Rudolph Apr 10 '18 at 09:13
  • Before I close the question, I would like to ask you two critical questions. In order to correctlly apply tolower to iso-8859-15 files, is to change the first line with `std::locale::global(std::locale("de_DE.iso-885915"));` ? Am I correct? Also, is it safe to convert `wstring` to `string` after the application of `tolower()`? – BugShotGG Apr 13 '18 at 17:04
  • @BugShotGG Unfortunately I don’t know the answer to either question definitely. (1) Depends on the system. On my system I’d have to use `de_DE.ISO8859-15` (see `locale -a` output on Unix) and this would probably not work with `wchar_t`. (2) How do you plan on converting it? Reliable conversion between `std::string` and `std::wstring` in a given encoding is hard in standard C++ without external libraries since neither are encoding-aware. – Konrad Rudolph Apr 13 '18 at 17:10