Greek Character Conversion

Question

I am trying to make a simple -ancient greek to modern greek- converter, in c, by changing the tones of the vowels. For example, the user types a text in greek which conains the character: ῶ (unicode: U+1FF6), so the program converts it into: ώ (unicode:U+1F7D). Greek are not sopported by c, so I don't know how to make it work. Any ideas?

StackOverflow is not a programming service. Show code and where you get stuck. — meaning-matters, Dec 26 '17 at 14:58
I thought that StackOverflow was not only an online debugger. I didn't ask to write a code for me, I just asked for a tip to help me continue. I wrote directly my problem instead of letting you to search errors in an incorrect part of my code. — joe jordishon, Dec 26 '17 at 15:08
Post more of your idea/problem - else this is too broad. If only one vowel substitution is possible an `if()` works fine. If there are dozens or hundreds of case considerations, then other approaches should be used. C supports Unicode. — chux - Reinstate Monica, Dec 26 '17 at 19:28

score 3 · Accepted Answer · answered Dec 28 '17 at 04:58

Assuming you use a sane operating system (meaning, not Windows), this is very easy to achieve using C99/C11 locale and wide character support. Consider filter.c:

#include <stdlib.h>
#include <locale.h>
#include <wchar.h>
#include <stdio.h>

wint_t convert(const wint_t  wc)
{
    switch (wc) {
    case L'ῶ': return L'ώ';
    default:   return wc;
    }
}

int main(void)
{
    wint_t  wc;

    if (!setlocale(LC_ALL, "")) {
        fprintf(stderr, "Current locale is unsupported.\n");
        return EXIT_FAILURE;
    }
    if (fwide(stdin, 1) <= 0) {
        fprintf(stderr, "Standard input does not support wide characters.\n");
        return EXIT_FAILURE;
    }
    if (fwide(stdout, 1) <= 0) {
        fprintf(stderr, "Standard output does not support wide characters.\n");
        return EXIT_FAILURE;
    }

    while ((wc = fgetwc(stdin)) != WEOF)
        fputwc(convert(wc), stdout);

    return EXIT_SUCCESS;
}

The above program reads standard input, converts each ῶ into a ώ, and outputs the result.

Note that wide character strings and characters have an L prefix; L'ῶ' is a wide character constant. These are only in Unicode if the execution character set (the character set the code is compiled for) is Unicode, and that depends on your development environment. (Fortunately, outside of Windows, UTF-8 is pretty much a standard nowadays -- and that is a good thing -- so code like the above Just Works.)

On POSIXy systems (like Linux, Android, Mac OS, BSDs), you can use the iconv() facilities to convert from any input character set to Unicode, do the conversion there, and finally convert back to any output character set. Unfortunately, the question is not tagged posix, so that is outside this particular question.

The above example uses a simple switch/case statement. If there are many replacement pairs, one could use e.g.

typedef struct {
    wint_t  from;
    wint_t  to;
} widepair;

static widepair  replace[] = {
    { L'ῶ', L'ώ' },
    /* Others? */
};
#define  NUM_REPLACE  (sizeof replace / sizeof replace[0])

and at runtime, sort replace[] (using qsort() and a function that compares the from elements), and use binary search to quickly determine if a wide character is to be replaced (and if so, to which wide character). Because this is a O(log₂N) operation with N being the number of pairs, and it utilizes cache okay, even thousands of replacement pairs is not a problem this way. (And of course, you can build the replacement array at runtime just as well, even from user input or command-line options.)

For Unicode characters, we could use a uint32_t map_to[0x110000]; to directly map each code point to another Unicode code point, but because we do not know whether wide characters are Unicode or not, we cannot do that; we do not know the code range of the wide characters until after compile time. Of course, we can do a multi-stage compilation, where a test program generates the replace[] array shown above, and outputs their codes in decimal; then do some kind of auto-grouping or clustering, for example bit maps or hash tables, to do it "even faster".

However, in practice it usually turns out that the I/O (reading and writing the data) takes more real-world time than the conversion itself. Even when the conversion is the bottleneck, the conversion rate is sufficient for most humans. (As an example, when compiling C or C++ code with the GNU utilities, the preprocessor first converts the source code to UTF-8 internally.)

meaning-matters · Answer 2 · 2017-12-26T16:07:24.637

2

Okay, here's some quick advice. I wouldn't use C because Unicode is not wel supported (yet).

A better language choice would be Python, Java, ..., anything with good Unicode support.

I'd write a utility that reads from standard input and writes to standard output. This makes it easy to use from the command line and in scripts.

I might be missing something but it's going to be something like this (in pseudo code):

while ((inCharacter = getCharacterFromStandardInput) != EOF
{
    switch (inCharacter)
    {
        case 'ῶ': outCharacter = ώ; break
        ...
    }

    writeCharacterToStandardOutput(outCharacter)
}

You'll also need to select & handle the format: UTF-8/16/32.

That's it. Good luck!

edited Dec 26 '17 at 16:07

answered Dec 26 '17 at 15:59

meaning-matters

21,929
10
82
142

1

Unicode code points have variable length in UTF-8 and UTF-16. They cannot be represented as single byte `char`, except a small subset of UTF-8. Also, C is a low level language and can easily handle Unicode, even the versions of C which are older than Unicode. But absent any additional information, your suggestion of picking another language probably fits the best. – Barmak Shemirani Dec 26 '17 at 16:57

Greek Character Conversion

2 Answers2