Translating source code into a foreign language

Question

I'm running an educational website which is teaching programming to kids (12-15 years old).

As they don't all speak English in the code source of the solutions we are using French variables and functions names. However we are planing to translate the content into other languages (German, Spanish, English). To do so I would like to translate the source code as fast as possible. We mostly have C/C++ code.

The solution I'm planning to use :

extract all variables/functions names from the source-code, with their position in the file (where they are declared, used, called...)
remove all language keywords and library functions
ask the translator to provide translations for the remaining names
replace the names in the file

Is there already some open-source code/project that can do that ? (For the points 1,2 and 4)

If there isn't, the most difficult point in the first one : using a C/C++ parser to build a syntactical tree and then extracting the variables with their position seems the way to go. Do you have others ideas ?

Thank you for any advice.

Edit : As noted in a comment I will also need to take care of the comments but there is only a few of them : the complete solution is already explained in plain-text and then we are showing the code-source with self-explained variable/function names. The source code is rarely more that 30/40 lines long and good names must make it understandable without comments if you already know what the code is doing.

Additional info : for the people interested the website is a training platform for the International Olympiads in Informatics and C/C++ (at least the minimum needed for programming contest) is not so difficult to learn by a 12 years old.

Try putting the code directly into google translate. It does a pretty good job of only translating words. The things it does "accidentaly" translate could be dealt with by running the code through something that replaces them with known substitutes. — , Aug 27 '11 at 15:33
Some would question the decision to use C/C++ to teach children that age, but I wrote C when I was 15 and did not suffer any damage, as far as I can tell. (Writing Pascal at an earlier age harmed me more, because I didn't have any pronunciation guide for the many keywords there. Took me years to stop pronouncing "begin" as if it was the Israeli prime minister). — hmakholm left over Monica, Aug 27 '11 at 15:39
Some variables names can be easily translated by an automated tools. For example (French->English) : "longueurMax" -> "maxLength". Also we want a (almost) perfect translation so the variables names can't be automatically translated but chosen by the translator-programmer. — Loïc Février, Aug 27 '11 at 15:41
I don't think this is a good idea. A correct foreign word is better than an incorrect native word. The translator will have 0 context to go by when translating. Many words have homonyms, how would that be resolved? I would not translate the source, I would leave it as is. Also, 12-15 year olds are all already learning English at school. — JRL, Aug 27 '11 at 15:51
@JRL: in our experience it is causing problems with kids (in France...). Also the translator will have the context (the original source code), the semi-automated translation is just here to help him not to forget anything. — Loïc Février, Aug 27 '11 at 15:57
@Loic: source code is not context for a translator, unless he/she is a programmer. — JRL, Aug 27 '11 at 15:59
You are doing your students enough of a disservice by teaching them either C or C++ or both; it is far worse to pretend that they constitute a single hybrid language "C/C++". Idiomatic, well-written C is worlds apart from idiomatic, well-written C++. — Karl Knechtel, Aug 27 '11 at 21:52
@Karl Knechtel : I used C/C++ here as a shortcut. In practice one can consider that we are teaching C++ with C I/O (for speed) and with classes limited to struct (+ a few methods like `< operator`). The point is not to be "C" or "C++" but do some C++ with the best of each language and with one goal : to code algorithmic challenges fast, with a short code and without any bug. — Loïc Février, Aug 27 '11 at 22:02
Yes, that's exactly the kind of hybridization I'm talking about. It gets in the way of writing the bug-free, short code you want them to write; and algorithms are about the process, not the implementation. — Karl Knechtel, Aug 27 '11 at 23:43
@Karl Knechtel : I don't agree. Pure C++ is too slow (for the I/O) and they don't need the full object oriented aspect. The process yes but when you have tight time/memory constraints the implementation matters. However C I/O are not teached to the beginners.. — Loïc Février, Aug 28 '11 at 00:53
WTF work do you have beginners doing where I/O speed matters? — Karl Knechtel, Aug 28 '11 at 01:32
As I said we are preparing them for IOI (ioinformatics.org) and a factor 2 of I/O speed can be critical (for some problems, not all). However we are only teaching this after they've been trained for a few years. An other example : this year at the IOI for one problem it was specifically stated "the STL is too slow if you want to have full score, don't use it". — Loïc Février, Aug 28 '11 at 12:39
Just out of curiosity, could we have a link to the website? Is it http://ioinformatics.org/index.shtml or something else? (P.S. I am 15. ;)) — Mateen Ulhaq, Nov 08 '11 at 02:49
It's http://www.france-ioi.org but the content is still not translated to English most of it is stil in french... — Loïc Février, Nov 08 '11 at 11:50

Shahbaz · Answer 1 · 2011-11-08T00:15:22.750

2

You don't really need a C/C++ parser, just a simple lexer that gives you elements of the code one by one. Then you get a lot of {, [, 213, ) etc that you simply ignore and write to the result file. You translate whatever consists of only letters (except keywords) and you put them in the output.

Now that I think about it, it's as simple as this:

bool is_letter(char c)
{
    return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
}
bool is_keyword(string &s)
{
    return s == "if" || s == "else" || s == "void" /* rest of them */;
}
void translateCode(istream &in, ostream &out)
{
    while (!in.eof())
    {
        char c = in.get();
        if (is_letter(c))
        {
            string name = "";
            do
            {
                name += c;
                c = in.get();
            } while (is_letter(c) && !in.eof());
            if (is_keyword(name))
                out << name;
            else
                out << translate(name);
        }
        out << c;  // even if is_letter(c) was true, there is a new c from the
                   // while inside that was read (which was not letter), but
                   // not written, so would be written here.
    }
}

I wrote the code in the editor, so there may be minor errors. Tell me if there are any and I'll fix it.

Edit: Explanation:

What the code does is simply to read input character by character, outputting whatever non-letter characters it reads (including spaces, tabs and new lines). If it does see a letter though, it will start putting all the following letters in one string (until it reaches another non-letter). Then if the string was a keyword, it would output the keyword itself. If it was not, would translate it and output it.

The output would have the exact same format as the input.

edited Nov 08 '11 at 00:15

answered Aug 27 '11 at 15:44

Shahbaz

46,337
19
116
182

void translateCode(istream &in, ostream &out);, remove the final ;. – Aug 27 '11 at 15:46
1

The list of all keywords is there for cpp: http://en.cppreference.com/w/cpp/keywords and there for c: http://tigcc.ticalc.org/doc/keywords.html - but on top of them you have to take care of all standard symbols such as `cin`, `cout`, `printf`, etc (tons of them) and of the header file names. Nevertheless, this can be a good start – Shlublu Aug 27 '11 at 15:51
Thanks about the ;. Also, good reminder about c/c++ already defined functions and objects @shlubu. It's just a matter of gathering the list of things they have written in their programs that should be excluded from translation and putting it there in the program. – Shahbaz Aug 27 '11 at 16:02
Talking about standard symbols, header file names should also be excluded. – Shahbaz Aug 27 '11 at 16:03
There's a standard C `isalpha` function that you could use instead of your "is_letter". – Mat Aug 27 '11 at 16:06
Thanks. Instead of `is_letter` I will likely need to user something more complex to handle names such that `my_variable_45`. Certainly some regex to extract all the possible identifiers – Loïc Février Aug 27 '11 at 16:14
I was going to write a comment about using `switch` for readability in `is_keyword`, but I read http://stackoverflow.com/questions/650162/why-switch-statement-cannot-be-applied-on-strings and realized that I am a spoiled C# dev. – anthony sottile Aug 27 '11 at 16:41
@Anthony Sottile: haha, you sure are! Although I don't understand why you would need a switch for a simple or statement. It's not like you want to do a different thing based on which keyword you have! – Shahbaz Aug 27 '11 at 17:58
@Loïc Février Depends on how you want to translate. Currently, the translator will get `my`, translate it, then the program reads `_` and prints it after the translation of `my`. Then `variable` is read and then `_` is printed. Then `4` is read and printed followed by `5` resulting in something like `mon_variable_45`. If you are considered about the `mon` and `ma` (gender), you could add `_` to `is_letter`. Then your translator gets `my_variable_` as input, translating to `ma_variable_` and then `4` is read and then `5` is read, resulting in `ma_variable_45` – Shahbaz Aug 27 '11 at 18:01
It would have been merely for aesthetics :) the or statements work perfectly fine. – anthony sottile Aug 28 '11 at 04:03
@Loïc Février did you finally get your answer to this question? – Shahbaz Sep 20 '11 at 09:59

score 2 · Answer 2 · answered Aug 27 '11 at 15:46

Are you sure you need a full syntax tree for this? I think it would be enough to do lexical analysis to find the identifiers, which is much easier. Then exclude keywords and identifiers that also appear in the header files being included.

In principle it is possible that you want different variables with the same English name to be translated to different words in French/German -- but for educational use the risk of this arising is probably small enough to ignore at first. You could sidestep the issue by writing the original sources with some disambiguating quasi-Hungarian prefixes and then remove these with the same translation mechanism for display to English-speaking end users.

Be sure to let translators see the name they are translating with full context before they choose a translation.

Indeed the source codes are smalls (but there is a lot of them) and the variables are always well-named, in particular : never the same name if different meaning, no one-letter variables... And yes the translator will have full context, this translation tool is just here to help him. — Loïc Février, Aug 27 '11 at 16:02

score 2 · Answer 3 · answered Aug 27 '11 at 16:17

I really think you can use clang (libclang) to parse your sources and do what you want (see here for more information), the good news is that they have python bindings, which will make your life easier if you want to access a translation service or something like that.

score 0 · Answer 4 · answered Nov 08 '11 at 02:44

I don't think replacing identifiers in the code is a good idea.

First, you are not going to get decent translations. A very important point here is that translation (especially automatic or pretty dumb translation) loses and distorts information. You may actually end up with something that's worse than the original.

Second, if the code is meant to be compiled again, the compiler may not be able to compile code containing non-English letters in the translated identifiers.

Third, if you replace identifiers with something else, you need to make sure you don't replace 2 or more different identifiers with the same word. That'll either make the code non-compilable or ruin its logic.

Fourth, you must make sure you don't translate reserved words and identifiers coming from the standard library of the language either. Translating those will make the code non-compilable and unreadable. It may not be a very trivial task to differentiate between the identifiers that the programmer has defined from those provided by the language and its standard library.

What I'd do instead of replacing identifiers with their translations is, provide the translations as comments next to them, for example:

void eat/*comer*/(int* food/*comida*/)
{
  if (*food/*comida*/ <= 0)
  {
    printf("nothing to eat!"/*no hay que comer!*/);
    exit/*salir*/(-1);
  }
  (*food/*comida*/)--;
}

This way you lose no information due to incorrect translation and don't break the code.

Translating source code into a foreign language

4 Answers4