3

I'm looking for a way of converting a wstring into a plain string containing only ASCII characters. Any character that isn't present in ASCII (0-127) should be converted to the closest ASCII character. If there is no similar ASCII character, the character should be omitted.

To illustrate, let's assume the following wide string:

wstring text(L"A naïve man called 晨 was having piña colada and crème brûlée.");

The converted version I'm looking for is this (notice the absence of diacritics):

string("A naive man called  was having pina colada and creme brulee.")

Edit:

Regarding the purpose: I'm writing an application that analyzes English texts. The input files are UTF-8 and may contain special characters. A part of my application uses a library written in C that only understands ASCII. So I need a way of "dumbing down" the text to ASCII without losing too much information.

Regarding the precise requirements: Any character that is a diacritic version of an ASCII character should be converted to that ASCII character; all other characters should be omitted. So ı, ĩ, and î should become i because they are all versions of the small Latin letter i. The character ɩ (iota), on the other hand, while visually similar, is not a version of the small Latin letter i and should thus be omitted.

Daniel Wolf
  • 12,855
  • 13
  • 54
  • 80
  • 1
    *"Any character that isn't present in ASCII (0-127) should be converted to the closest ASCII character. If there is no similar ASCII character, the character should be omitted."* This does not sound well defined at all. Is † almost t? – Baum mit Augen May 23 '16 at 20:36
  • 2
    You just have to define "similar" and "closest". A huge table, perhaps? – Bo Persson May 23 '16 at 20:37
  • *to allow for a wider range of possible solutions* that goes against the site. We want a well defined question that has a narrow scope for answers. As is IMHO this is too broad. – NathanOliver May 23 '16 at 20:48
  • see https://en.wikipedia.org/wiki/List_of_Unicode_characters#Basic_Latin and build a table of conversions. only the latin characters need to be converted, any uni-code above that won't map – Gregg May 23 '16 at 21:00
  • @NathanOliver: Good point. I've removed my comment and edited the answer. – Daniel Wolf May 23 '16 at 21:05
  • 2
    Maybe useful: http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string – mindriot May 23 '16 at 21:11
  • 1
    Maybe useful, see C/C++ library behind this [demo:](http://demo.icu-project.org/icu-bin/translit), select `Accents` as sample, `Latin` as Source1 and `ASCII` as target... – Anto Jurković May 23 '16 at 21:26
  • @mindriot how is a Python answer using libraries that aren't available in C++ useful? – Mark Ransom May 23 '16 at 21:31
  • @MarkRansom Because it led me to what I have now posted below as an answer. Python solution → Google "unidecode" → find (basic) C++ port. But there's a reason why I wrote it as a comment rather than an answer. :) – mindriot May 23 '16 at 21:38

2 Answers2

4

On GitHub, there is unidecode-cxx which is a (somewhat unfinished) C++ port of node-unidecode, which is in turn a JavaScript port of Perl's Text::Unicode. The C++ version is a bit rough around the edges, but the example in src/unidecode.cxx can be modified to convert your example string,

A naïve man called 晨 was having piña colada and crème brûlée.

as follows:

A naive man called Chen was having pina colada and creme brulee.

In order to get the code to compile without Gyp (something I've never used and haven't had the time to figure out just now), I had to modify the code somewhat (quick and dirty):

  • Add #include <iostream> to src/unidecode.cxx, and add the following main routine:

    int main() {
      string output_buf;
      string input_buf = "A naïve man called 晨 was having piña colada and crème brûlée.";
      unidecode(&input_buf, &output_buf);
      cout << output_buf.c_str() << endl;
    }
    
  • Replace all mentions of NULL in src/data.cxx with nullptr

Then I compiled with

g++ -std=c++11 -o unidecode unidecode.cxx

to get the desired result.

The code looks like a fairly primitive port and could do with some improvements, especially into more "proper" C++. It internally uses a statically compiled conversion table, which can probably be adapted to suit your needs if it does not.

mindriot
  • 5,413
  • 1
  • 25
  • 34
0

wstring is a string of wchar which is a character that may have size of 2 or 4 bytes. Meanwhile UTF8 is a variable length encoding with symbol size of 1-4 bytes. So your request is not fully consistent.

Assuming you've figured out how exactly data is stored in your strings I'd suggest you to check out ICU library to do further conversions.

You can normalize your strings and then remove all diacritics. But still you'll be left with Greek, Cyrillic and stuff. Or you can use transliteration feature which is more like what you're looking for.

The mindriot's solution is more concise but still you need to convert you wstring to proper UTF8 sequence.

Teivaz
  • 5,462
  • 4
  • 37
  • 75