5

I have had some ideas for a new programming language floating around in my head, so I thought I'd take a shot at implementing it. A friend suggested I try using Treetop (the Ruby gem) to create a parser. Treetop's documentation is sparse, and I've never done this sort of thing before.

My parser is acting like it has an infinite loop in it, but with no stack traces; it is proving difficult to track down. Can somebody point me in the direction of an entry-level parsing/AST guide? I really need something that list rules, common usage etc for using tools like Treetop. My parser grammer is on GitHub, in case someone wishes to help me improve it.

class {
  initialize = lambda (name) {
    receiver.name = name
  }

  greet = lambda {
    IO.puts("Hello, #{receiver.name}!")
  }
}.new(:World).greet()
ravinggenius
  • 816
  • 1
  • 6
  • 14

2 Answers2

8

I asked treetop to compile your language into an .rb file. That gave me something to dig into:

$ tt -o /tmp/rip.rb /tmp/rip.treetop

Then I used this little stub to recreate the loop:

require 'treetop'
load '/tmp/rip.rb'
RipParser.new.parse('')

This hangs. Now, isn't that interesting! An empty string reproduces the behavior just as well as the dozen-or-so-line example in your question.

To find out where it's hanging, I used an Emacs keyboard macro to edit rip.rb, adding a debug statement to the entry of each method. For example:

def _nt_root
  p [__LINE__, '_nt_root'] #DEBUG
  start_index = index

Now we can see the scope of the loop:

[16, "root"]
[21, "_nt_root"]
[57, "_nt_statement"]
...
[3293, "_nt_eol"]
[3335, "_nt_semicolon"]
[3204, "_nt_comment"]
[57, "_nt_statement"]
[57, "_nt_statement"]
[57, "_nt_statement"]
...

Further debugging from there reveals that an integer is allowed to be an empty string:

rule integer
   digit*
end

This indirectly allows a statement to be an empty string, and the top-level rule statement* to forever consume empty statements. Changing * to + fixes the loop, but reveals another problem:

/tmp/rip.rb:777:in `_nt_object': stack level too deep (SystemStackError)
        from /tmp/rip.rb:757:in `_nt_compound_object'
        from /tmp/rip.rb:1726:in `_nt_range'
        from /tmp/rip.rb:1671:in `_nt_special_literals'
        from /tmp/rip.rb:825:in `_nt_literal_object'
        from /tmp/rip.rb:787:in `_nt_object'
        from /tmp/rip.rb:757:in `_nt_compound_object'
        from /tmp/rip.rb:1726:in `_nt_range'
        from /tmp/rip.rb:1671:in `_nt_special_literals'
         ... 3283 levels...

Range is left-recursing, indirectly, via special_literals, literal_object, object, and compound_object. Treetop, when faced with left recursion, eats stack until it pukes. I don't have a quick fix for that problem, but at least you've got a stack trace to go from now.

Also, this is not your immediate problem, but the definition of digit is odd: It can either one digit, or multiple. This causes digit* or digit+ to allow the (presumably) illegal integer 1________2.

Wayne Conrad
  • 103,207
  • 26
  • 155
  • 191
  • Wow, thank you so much! I updated the grammar concerning numbers, but I'm not sure about the left-recursion. I'll have to educate myself a bit more before proceeding. – ravinggenius May 24 '11 at 15:13
  • @Raving G., I'm glad this is useful to you, and sorry I can't give you more concrete advice. Everything I know about left-recursion and Treetop is here: http://treetop.rubyforge.org/pitfalls_and_advanced_techniques.html. A new question, "how to deal with treetop left-recursion," might be good. – Wayne Conrad May 24 '11 at 19:01
  • What command did you run to get your stack trace? I've been trying everything I can think of, but I cannot get a trace longer than one. – ravinggenius May 24 '11 at 19:28
  • @Raving G., For the "stack level too deep trace", I used the grammar you linked to (but I first changed one instance of `digit*` to `digit+`, as noted), used the `tt` command to compile your grammar, and then ran the three-line stub that starts `require 'treetop'`. That stub fed the parser an empty input string. – Wayne Conrad May 24 '11 at 19:44
1

I really enjoyed Language Implementation Patterns by Parr; since Parr created the ANTLR parser generator, it's the tool he uses throughout the book, but it should be simple enough to learn from it all the same.

What I really liked about it was the way each example grew upon the previous one; he doesn't start out with a gigantic AST-capable parser, instead he slowly introduces problems that need more and more 'backend smarts' to do the job, so the book scales well along with the language that needs parsing.

What I wish it covered in a little more depth is the types of languages that one can write and give advice on Do's and Do Not Do's when designing languages. I've seen some languages that are a huge pain to parse and I'd have liked to know more about the design decisions that could have been made differently.

sarnold
  • 102,305
  • 22
  • 181
  • 238
  • I watched the first five or six tutorial videos, but so far I haven't been able to get a working setup to follow along with. You are correct that he doesn't start off too fast. I'll give ANTLR another try. – ravinggenius May 24 '11 at 15:16
  • 1
    @Raving, well, I liked the look of your `treetop` grammar, and if you wanted to program in Ruby, then switching to ANTLR won't help that much. :) I just thought his book did an excellent job discussing how to build good parsers, and ANTLR was the tool he choose to use. – sarnold May 24 '11 at 20:44