Reverse Engineering a Programming Language or 'Unsupervised Learning of Languages'

Question

I need to build a "translator" (is cross-compiler the right word?) between Tradestation's EasyLanguage into C++. However, there isn't any complete documentation on the grammar of EasyLanguage (which I could find).

As a more general question, given a set of valid programs in some Language 'A', is it possible to discern a grammar for 'A' if we know (or even if we don't know) of the existence of certain basic tokens like 'if' 'else' and reserved words, or is this one of those unsolved case specific (hard?) questions.

Are there any useful tools I can use to start?

On the specifc EasyLanaguage issue does this not meet your needs? http://www.lombardreport.it/uploads/dispense/manuale.pdf "EasyLanguage enables you to use functions residing in dynamic-link libraries (written in C or C++) in your trading signals, analysis techniques, and functions. This means that in addition to all the EasyLanguage reserved words and functions, you also have at your disposal any function in a DLL that is written in C or C++." — Wudang, Jun 28 '11 at 14:58
At least you have one pretty thorough reference manual (Wudang's link). That's not a bad place to start, even if you have to induce the grammar by hand. — Ira Baxter, Jun 28 '11 at 15:07

score 5 · Accepted Answer · edited May 23 '17 at 12:00

The simple answer is "No".

Any kind of generalization from examples suffers from the basic fact that it is guessing. You may guess that the langauge has an 'if' token. There's no guarantee that it does, or that it is spelled if or that it has semantics that you understand. You're not going to get an automated tool to induce the grammar for you.

Your best bet is to take all the documents you can get that describe the langauge, and, well, guess at a grammar. Then you build a parser for the grammar, and validate it against as big a code base as you can find, and revise. I've done this dozens of times with a wide variety of langauges (see my bio).

It is painful, but you often get someplace pretty useful. The good news is that your parser doesn't have to parse anything the users don't know how to write. The bad news is they'll write things based on some obscure example you've never seen, or with a typo that accidentally works. (Even the language designer didn't intend it, but that doesn't matter to the user; his program works and your compiler doesn't. Your problem by definition).

What you'll never know is if the the provider of the language has certain features he simply hasn't documented, and hasn't shown anyone else. Be continually prepared to be surprised, long after you are done :-{

Now, the best tool you can use for this process IMHO is a GLR parser generator; it is what my company uses. These will parse any context-free langauge (that you might propose) without a lot of struggle to bend the grammar to match the other-common restrictions of recursive descent, LL(k), or LR(k) parsers. Life is hard enough to to guess the grammar, let alone guess the grammar and then guess how to bend to it make the parser generator swallow it correctly.

You also have the problem of building a translator, once you get the grammar right. You might find this SO answer helpful: What kinds of patterns could I enforce on the code to make it easier to translate to another programming language?

Reverse Engineering a Programming Language or 'Unsupervised Learning of Languages'

1 Answers1