0

I'm needing to compare data that has been cultivated from various locations, some of which have non-ascii characters, specifically the english letters with accents on them. An example is "Frédérik Gauthier� : -61� : -87� : -61� : -87". When I looked at the int values for the character, I've noticed that these characters are always a combination of 2 "characters" with values of -61 indicating the letter will be accented followed by the letter, in this case -87 for the accented 'e'. My goal is to just "drop" the accent and use the english character. Obviously, I can't rely on this behavior from system to system, so how do you handle this situation? std::string, handles the characters without issue, but as soon as I get to the char level, that's where the issues come up. Any guidance?

#include <iostream>
#include <fstream>
#include <algorithm>

int main(int argc, char** argv){
    std::fstream fin;
    std::string line;
    std::string::iterator it;
    bool leave = false;
    fin.open(argv[1], std::ios::in);

    while(getline(fin, line)){
        std::for_each(line.begin(), line.end(), [](char &a){
            if(!isascii(a)) {
                if(int(a) == -68) a = 'u';
                else if(int(a) == -74) a = 'o';
                else if(int(a) == -83) a = 'i';
                else if(int(a) == -85) a = 'e';
                else if(int(a) == -87) a = 'e';
                else if(int(a) == -91) a = 'a';
                else if(int(a) == -92) a = 'a';
                else if(int(a) == -95) a = 'a';
                else if(int(a) == -120) a = 'n';
            }
        });
        it = line.begin();
        while(it != line.end()){
            it = std::find_if(line.begin(), line.end(), [](char &a){ return !isascii(a); });
            if(it != line.end()){
                line.erase(it);
                it = line.begin();
            }
        }
        std::cout << line << std::endl;
        std::for_each(line.begin(), line.end(), [&leave](char &a){
            if(!isascii(a)) {
                std::cout << a << " : " << int(a);
            }
        });
        if(leave){
            fin.close();
            return 1;
        }
    }
    fin.close();
    return 0;
}
TriHard8
  • 592
  • 6
  • 18
  • 1
    Unrelated to your problem, but I recommend that you use [`std::transform`](https://en.cppreference.com/w/cpp/algorithm/transform) to *transform* the contents of a container, instead of your `std::for_each` call. Semantics, maybe, but proper semantics make the code easier to read and understand, and therefore also easier to maintain. – Some programmer dude Oct 19 '19 at 02:29
  • `std::string` doesn't handle any characters, it's just a container for a `char` array with some sugar. It doesn't know and doesn't care what data it contains, and can store text of any encoding. For us to be able to answer your question, we need to know what encoding the input text has (probably UTF-8) and what encoding your output text is supposed to have (probably ANSI). However there are plenty of questions already on Stack Overflow dealing with character encoding conversion, so your question is very likely a duplicate of whichever question fits your specific encodings. – Max Vollmer Oct 19 '19 at 02:42
  • These questions might be helpful for you: [Converting character encoding within c++](https://stackoverflow.com/questions/18506588/converting-character-encoding-within-c) and [How to convert from UTF-8 to ANSI using standard c++](https://stackoverflow.com/questions/17562736/how-to-convert-from-utf-8-to-ansi-using-standard-c) and [Does C++ support converting between character encodings other than UTF-8, UTF-16, and UTF-32?](https://stackoverflow.com/questions/24563521/does-c-support-converting-between-character-encodings-other-than-utf-8-utf-16) But there are plenty more. – Max Vollmer Oct 19 '19 at 02:44
  • In your specific case I would recommend figuring out the input encoding and implementing a step that converts that to UTF-32. Then you can simply loop over the result as `uint32_t` array, where each `uint32_t` is a single character (no need to deal with complicated variable size characters), cast all that are ASCII to a char, and for those that aren't ASCII have a lookup-table for things like `é` -> `e`. – Max Vollmer Oct 19 '19 at 02:47

1 Answers1

1

This is a tricky task in general and you'll probably need to adapt your solution to your particular task. To transliterate your string from whatever encoding it's in to ASCII, it's best to rely on a library instead of trying to implement this yourself. Here's an example using iconv:

#include <iconv.h>
#include <memory>
#include <type_traits>
#include <string>
#include <iostream>
#include <algorithm>
#include <string_view>
#include <cassert>
using namespace std;

string from_u8string(const u8string &s) {
  return string(s.begin(), s.end());
}

using iconv_handle = unique_ptr<remove_pointer<iconv_t>::type, decltype(&iconv_close)>;
iconv_handle make_converter(string_view to, string_view from) {
    auto raw_converter = iconv_open(to.data(), from.data());
    if (raw_converter != (iconv_t)-1) {
        return { raw_converter, iconv_close };
    } else {
        throw std::system_error(errno, std::system_category());
    }
}

string convert_to_ascii(string input, string_view encoding) {
    iconv_handle converter = make_converter("ASCII//TRANSLIT", encoding);

    char* input_data = input.data();
    size_t input_size = input.size();

    string output;
    output.resize(input_size * 2);
    char* converted = output.data();
    size_t converted_size = output.size();

    auto chars_converted = iconv(converter.get(), &input_data, &input_size, &converted, &converted_size);
    if (chars_converted != (size_t)(-1)) {
        return output;
    } else {
        throw std::system_error(errno, std::system_category());
    }
}

string convert_to_plain_ascii(string_view input, string_view encoding) {
    auto converted = convert_to_ascii(string{ input }, encoding);
    converted.erase(
        std::remove_if(converted.begin(), converted.end(), [](char c) { return !isalpha(c); }),
        converted.end()
    );
    return converted;
}

int main() {
    try {
        auto converted_utf8 = convert_to_plain_ascii(from_u8string(u8"Frédérik"), "UTF-8");
        assert(converted_utf8 == "Frederik");
        auto converted_1252 = convert_to_plain_ascii("Frédérik", "windows-1252");
        assert(converted_1252 == "Frederik");
    } catch (std::system_error& e) {
        cout << "Error " << e.code() << ": " << e.what() << endl;
    }
}
Ayjay
  • 3,413
  • 15
  • 20