5

Are there any common solutions how to use incomplete grammars? In my case I just want to detect methods in Delphi (Pascal)-files, that means procedures and functions. The following first attempt is working

    methods
      : ( procedure | function | . )+
      ;

but is that a solution at all? Are there any better solutions? Is it possible to stop parsing with an action (e. g. after detecting implementation). Does it make sense to use a preprocessor? And when yes - how?

ANTLRStarter
  • 309
  • 1
  • 4
  • 16
  • I'm not sure it is possible with antlr and alike, but it is rather trivial with the PEG-based parsers - just define a rule like `(!methods .)+ / methods`, and it will parse the entire stream, detecting everything that look like `methods`. Probably you'd want to handle comments and string literals here as well. – SK-logic Aug 26 '11 at 11:33
  • Do you want everything in between `Begin` and `End;` of the procedures/functions as well? Or are you only interested in the names of them? – Bart Kiers Aug 26 '11 at 11:42
  • @Bart: First only the names... – ANTLRStarter Aug 26 '11 at 12:35
  • @SK-logic: Many thanks for your answer, will take also a look at PEG-based parsers. – ANTLRStarter Aug 26 '11 at 12:42
  • I'm sorry, the rule should be following: `((!methods .)/methods)+`. Here, `.` parses any character. – SK-logic Aug 26 '11 at 12:45
  • @Bart: And now ... the next step - simplest approach for getting the body of procedures/functions, or as you asked "Do you want everything in between 'Begin' and 'End;' of the procedures/functions as well?" Many thanks in advance. – ANTLRStarter Oct 02 '11 at 08:35
  • @ANTLRStarter, to parse the method body is pretty much the same as parsing the entire file. Note that there are 2 Pascal grammars on the ANTLR wiki: http://www.antlr.org/grammar/list (never tried them though...). – Bart Kiers Oct 02 '11 at 11:20

2 Answers2

4

If you're only looking for names, then something as simple as this:

grammar PascalFuncProc;

parse
  :  (Procedure | Function)* EOF
  ;

Procedure
  :  'procedure' Spaces Identifier
  ;

Function
  :  'function' Spaces Identifier
  ;

Ignore
  :  (StrLiteral | Comment | .) {skip();}
  ;

fragment Spaces     : (' ' | '\t' | '\r' | '\n')+;
fragment Identifier : ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*;
fragment StrLiteral : '\'' ~'\''* '\'';
fragment Comment    : '{' ~'}'* '}';

will do the trick. Note that I am not very familiar with Delhpi/Pascal, so I am surely goofing up StrLiterals and/or Comments, but that'll be easily fixed.

The lexer generated from the grammar above will only produce two type of tokens (Procedures and Functions), the rest of the input (string literals, comments or if nothing is matched, a single character: the .) is being discarded from the lexer immediately (the skip() method).

For input like this:

some valid source
{ 
  function NotAFunction ...
}

procedure Proc
Begin
  ...
End;

procedure Func
Begin
  s = 'function NotAFunction!!!'
End;

the following parse tree is created:

enter image description here

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Bart's great with ANTLR stuff, thus his instant island grammar. But ... it has issues (not Bart's fault). If your langauge allows text strings or comments, this will erroneously pick up **/* procedure Proc is foobarred */** as a procedure declaration. This is one of the issues with island grammars: they have to be accurate enough for your purposes. Maybe you don't care, but be sure you don't care or you'll be badly surprised. – Ira Baxter Aug 26 '11 at 17:18
  • 1
    @Ira, err, I did define string literal- and comment-rules in the lexer that will be skipped (see the rules `StrLiteral` and `Comment`). Or did I misunderstand? – Bart Kiers Aug 26 '11 at 17:39
  • @Bart: No, I didn't see them in the code in your answer; its amazing what you don't see if you're sure you know what something says. I guess I'm just gonna have to learn to read more carefully :-{ Good job, Bart (gets cookie on comment). The fact that you added them makes my point, too :-} – Ira Baxter Aug 26 '11 at 17:40
  • @Ira, ah, good. I'm always on my guard when someone like you (meaning: someone far more experienced w.r.t. parsing) comments on one of my answers. Especially if that someone recently promised [to never doubt me](http://stackoverflow.com/questions/6918555/antlr-how-to-parse-a-region-within-matching-brackets-with-a-lexer?answertab=votes#tab-top)... ;) – Bart Kiers Aug 26 '11 at 18:22
4

What you asking about are called island grammars. The notion is that you define a parser for the part of the language you care about (the "island") with all the classic tokenization needed for that part, and that you define an extremely sloppy parser to skip the rest (the "ocean" in which the island is embedded). One common trick to doing this is to define correspondingly sloppy lexers, that pick up vast amounts of stuff (to skip past HTML to embedded code, you can try to skip past anything that doesn't look like a script tag in the lexer, for example).

The ANTLR site even discusses some related issues but notably says there are examples included with ANTLR. I have no experience with ANTLR so I don't know how useful this specific information is.

Having built many tools that use parsers to analyze/transform code (check my bio) I'm a bit of a pessimist about the general utility of island grammmars. Unless your goal is to do something pretty trivial with the parsed-island, you will need to collect the meaning of all identifiers it uses directly or indirectly... and most of them are unfortunately for you defined in the ocean. So IMHO you pretty much have to parse the ocean too to get past trivial tasks. You'll have other troubles, too, making sure you really skip the island stuff; this pretty much means your ocean lexer has know about whitespace, comment, and all the picky syntax of character strings (this is harder than it looks with modern languages) so that these get properly skipped over. YMMV.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341