Parsing regular expressions based on a context free grammar

Question

Good evening, Stack Overflow. I'd like to develop an interpreter for expressions based on a pretty simple context-free grammar:

Grammar

Basically, the language is constituted by 2 base statements

( SET var 25 ) // Output: var = 25
( GET ( MUL var 5 ) ) // Output: 125
( SET var2 ( MUL 30 5 ) ) //Output: var2 = 150

Now, I'm pretty sure about what should I do in order to interpret a statement: 1) Lexical analysis to turn a statement into a sequence of tokens 2) Syntax analysis to get a symbol table (HashMap with the variables and their values) and a syntactic tree (to perform the GET statements) to 3) perform an inorder visit of the tree to get the results I want.

I'd like some advice on the parsing method to read the source file. Considering the parser should ignore any whitespace, tabulation or newline, is it possible to use a Java Pattern to get a general statement I want to analyze? Is there a good way to read a statement weirdly formatted (and possibly more complex) like this

(
  SET var

 25
 )

without confusing the parser with the open and closed parenthesises?

For example

Scanner scan; //scanner reading the source file
String pattern = "..." //ideal pattern I've found to represent an expression
while(scan.hasNext(pattern))
  Interpreter.computeStatement(scan.next(pattern));

would it be a viable option for this problem?

after you read all the file content and store it in a string, then `for(string part: fileContent.split("\s+")){/* ... */}`. now `part` stores meaningful characters. notice that you still have to perform lexical analysis to `part`, but it's easy. — Jason Hu, Dec 09 '14 at 19:33
The default delimiter for `Scanner` is to match white space, so you're all set to go with out adding any special regex at all. — markspace, Dec 09 '14 at 19:40
@markspace I'm aware that Scanner uses whitespace as delimiter, I was wondering how to make the Scanner recognise only one statement at once. Using Interpreter.computeStatement(scan.next()) should get only a token at once (a parenthesis, a keyword, a number or a variable) — fedexist, Dec 09 '14 at 19:55
The normal method is to lex (tokenize) the input file first, which is what I'm doing with Scanner. Then build a [parse tree](http://en.wikipedia.org/wiki/Parse_tree) as you lex, which is what I assumed you wanted to do. Then walk the parse tree to interpret or compile your code. — markspace, Dec 09 '14 at 20:18
Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers. — Ira Baxter, Dec 09 '14 at 21:48
Maybe this will help: http://stackoverflow.com/a/2336769/120163 — Ira Baxter, Dec 09 '14 at 22:03
@markspace: "build a parse tree as you lex..." This doesn't make any sense. — Ira Baxter, Dec 09 '14 at 22:11
Ok, I got it, I'm forced to parse per character (since scan.next() may give invalid tokens such as "(GET" which I would be forced to parse again to get valid tokens ). Thanks for the help. — fedexist, Dec 11 '14 at 11:24

score 1 · Answer 1 · edited May 23 '17 at 11:58

1

Solution proposed by Ira Braxter:

Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.

Maybe this will help: stackoverflow.com/a/2336769/120163

edited May 23 '17 at 11:58

Community

1
1

answered Aug 27 '15 at 09:16

Stephan

41,764
65
238
329

score 1 · Accepted Answer · answered Oct 04 '15 at 18:26

In the end, I understood thanks to Ira Baxter that this context free grammar can't be parsed with RegExp and I used the concepts of S-Expressions to build up the interpreter, whose source code you can find here. If you have any question about it (mainly because the comments aren't translated in english, even though I think the code is pretty clear), just message me or comment here.

Basically what I do is:

Parse every character and tokenize it (e.g '(' -> is OPEN_PAR, while "SET" -> STATEMENT_SET or a random letter like 'b' is parsed as a VARIABLE )
Then, I use the token list created to do a syntactic analysis, which checks the patterns occuring inside the token list, according to the grammar
If there's an expression inside the statement, I check recursively for any expression inside an expression, throwing an exception and going to the following correct statement if needed
At the end of analysing every single statement, I compute the statement as necessary as for specifications

Parsing regular expressions based on a context free grammar

2 Answers2