I am a bit confused since I have opened a question, I would like to be a bit more specific here.
I have numerous files that contain German letters mostly in iso-8859-15 or UTF-8 encoding. In order to process them it is mandatory to transform all letters to lowercase.
For example I have a file (encoded in iso-8859-15 ) that contains:
Dr. Rose in M. Das sogen. Baptisterium zu Winland, eins der im Art. "Baukunst" (S. 496) erwähnten Rundgebäude in Grönland, soll nach Palfreys "History of New England" eine von dem Gouverneur Arnold um 1670 erbaute Windmühle sein. Vgl. Gust. Storm in den "Jahrbüchern der königlichen Gesellschaft für nordische Altertumskunde in Kopenhagen" 1887, S. 296.
Ää Öö Üü ẞß Örebro
Text Ää Öö Üü ẞß Örebro
should become: ää öö üü ßß örebro
.
However, tolower()
does not seem to apply on capital letters such as Ä, Ö, Ü, ẞ eventhough i tried forcing locale as mentioned in this SO post
Here is the same code as posted in my other question:
std::vector<std::string> tokens;
std::string filename = "10223-8.txt";
//std::string filename = "test-UTF8.txt";
std::ifstream inFile;
//std::setlocale(LC_ALL, "en_US.iso88591");
//std::setlocale(LC_ALL, "de_DE.iso88591");
//std::setlocale(LC_ALL, "en_US.iso88591");
//std::locale::global(std::locale(""));
inFile.open(filename);
if (!inFile) { std::cerr << "Failed to open file" << std::endl; exit(1); }
std::string s = "";
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
}
inFile.close();
std::cout << s << std::endl;
//std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
s[i] = std::tolower(s[i]);
//s[i] = std::tolower(s[i]);
//s[i] = std::tolower(s[i], std::locale("de_DE.utf8"))
}
std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
//PROCESS TOKENS...
Its really frustrating and there are not many paradigms regarding the usage of <locale>
.
So, apart from the main problem with my code, here are some questions:
- Do I have to apply some sort of custom locale in other functions too (
isupper()
,ispunct()
...)? - Do I need
de_DE
locale enabled or installed in my linuxenv
to process string's chars correctly? - Is it safe to process, in the same way, text as
std::string
that is extracted from files with different encoding (iso-8859-15 or UTF-8)?
EDIT: Konrad Rudolph answer works fine only for UTF-8 files. It does not work for iso-8859-15 which translates to the initial problem posted here: How to apply functions on text files with different encoding in c++