2

I wanted to make a simple parser, for a "pseudo code" like language(kept rigid), in Java. A sample pseudo code would be -

//This is a comment
$x1 = readint
$x2 = readint

$dx = $x2 - $x1
#f = $dx / 2

if ($dx > 0)
{
  loop while(#f > 1)
  {
     print(#f)
     #f = #f / 2
  }
}

Note that above code is rigid in that, there can not be more than one statement on a line, integers start with $, floats start with # etc.

To parse such code, first I can use StringTokenizer, and then regular expression, to match integer-variables, float-variables, or Keywords.

Is this approach good? For statements in loop, how can i store expressions, so that i don't have to tokenize in each iteration?

I could think of converting expressions (like #f = #f / 2) to polish notation, and then to store in stack. And in each iteration, while popping operands I could replace value for each variable. But is this efficient enough?

Thanks in advance, for any suggestion.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Vinayak Garg
  • 6,518
  • 10
  • 53
  • 80
  • Possible duplicate of [this question](http://stackoverflow.com/questions/821501/parser-generator-for-java-with-the-following-requirements) – nobeh Mar 31 '12 at 17:03

3 Answers3

12

Although I think that it's great that you want to build a parser for a language like this, doing so is much harder than it looks. Parsing is a very well-studied problem and there are many excellent algorithms that you can use, but they are extremely difficult to implement by hand. While you can use tricks like conversions to RPN for smaller examples like parsing expressions, building up a full programming language requires a much more complex set of tricks.

To parse a language of this complexity, you are probably best off using a parser generator rather than trying to write your own by hand. ANTLR and Java CUP are two well-known tools for doing precisely what you're interested in accomplishing, and I would strongly suggest using one of the two of them.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • Having implemented some of these "extremely difficult to implement by hand" algorithms (e.g., LL, L(AL)R and GLR, I'd say the effort to implement them and use them for single project IMHO is often less than the effort to build a full parser using weaker methods. Agreed that you should get an off-the-shelf tool where you can, simply because it less work, not because it is extremely hard. – Ira Baxter Mar 31 '12 at 17:13
  • Thank you! I will go and look at both of them, but can you pick one of them? :P Considering my sample code, and that I would release my code under GPL, smaller footprint etc. Hehe, I would definitely make a parser, but probably some other day. – Vinayak Garg Mar 31 '12 at 17:15
  • @VinayakGarg- I don't have much experience with either, but ANTLR has a broad community base. It might be worth exploring. – templatetypedef Mar 31 '12 at 17:20
2

For simple languages (this is a judgement call, and if you are inexperienced you may not be able to make that call correctly), one can often write a recursive descent parser by hand that does well enough. The good news is that coding a recursive descent parser is pretty straightforward.

If you aren't sure, use overkill in the form of the strongest parser generator you can get.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • Your link was very informative, Thanks. And then the metacompiler, and then the website. It all seems great, I would love to do it, in next semester probably. But since currently I am short on time and experience, i will go for overkill! – Vinayak Garg Mar 31 '12 at 17:33
1

in simple cases writing manually a parser makes sense.

However, using StringTokenizer is a indicator of doing it wrong, because a StringTokenizer IS already a SIMPLE parser.

a parser usually reads a char and changes its state depending on the value of that char.

Just a simple parser, a "b" make following char "uppercase", e to lowercase. "." stops

 String input = "aDDbcDDeaaef.";

 int pos = 0;

 int state = 0;  
 while (pos < input.length()) {
    char z = input.charAt (pos);
    if (z == '.') break;
    switch (z) {
    case 'b': state = 1; break;
    case 'e': state = 0; break;
    default:
      if (state == 0) {
        System.out.print(Char.toLowerCase(z));
      } else {
        System.out.print(Char.toUpperCase(z));
      }
    }
    pos ++;
 }
stefan bachert
  • 9,413
  • 4
  • 33
  • 40
  • Well you are using the term SIMPLE parser as "a parser for a regular language" and your example shows a very simple regular language parser. In the original question, the language was considered as simple because there was no lexical ambiguity. However, the language itself is strictly context free. Hence there is no way to parse it using a simple finite state automaton as the one you implemented here. – dodecaplex Mar 11 '15 at 14:44
  • the incrementation of 'pos' is missing before the end of the while loop :) – loukios May 27 '17 at 12:01