3

Good evening, Stack Overflow. I'd like to develop an interpreter for expressions based on a pretty simple context-free grammar:

Grammar

Basically, the language is constituted by 2 base statements

( SET var 25 ) // Output: var = 25
( GET ( MUL var 5 ) ) // Output: 125
( SET var2 ( MUL 30 5 ) ) //Output: var2 = 150

Now, I'm pretty sure about what should I do in order to interpret a statement: 1) Lexical analysis to turn a statement into a sequence of tokens 2) Syntax analysis to get a symbol table (HashMap with the variables and their values) and a syntactic tree (to perform the GET statements) to 3) perform an inorder visit of the tree to get the results I want.

I'd like some advice on the parsing method to read the source file. Considering the parser should ignore any whitespace, tabulation or newline, is it possible to use a Java Pattern to get a general statement I want to analyze? Is there a good way to read a statement weirdly formatted (and possibly more complex) like this

(
  SET var

 25
 )

without confusing the parser with the open and closed parenthesises?

For example

Scanner scan; //scanner reading the source file
String pattern = "..." //ideal pattern I've found to represent an expression
while(scan.hasNext(pattern))
  Interpreter.computeStatement(scan.next(pattern));

would it be a viable option for this problem?

fedexist
  • 155
  • 2
  • 12
  • after you read all the file content and store it in a string, then `for(string part: fileContent.split("\s+")){/* ... */}`. now `part` stores meaningful characters. notice that you still have to perform lexical analysis to `part`, but it's easy. – Jason Hu Dec 09 '14 at 19:33
  • The default delimiter for `Scanner` is to match white space, so you're all set to go with out adding any special regex at all. – markspace Dec 09 '14 at 19:40
  • @markspace I'm aware that Scanner uses whitespace as delimiter, I was wondering how to make the Scanner recognise only one statement at once. Using Interpreter.computeStatement(scan.next()) should get only a token at once (a parenthesis, a keyword, a number or a variable) – fedexist Dec 09 '14 at 19:55
  • The normal method is to lex (tokenize) the input file first, which is what I'm doing with Scanner. Then build a [parse tree](http://en.wikipedia.org/wiki/Parse_tree) as you lex, which is what I assumed you wanted to do. Then walk the parse tree to interpret or compile your code. – markspace Dec 09 '14 at 20:18
  • 4
    Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers. – Ira Baxter Dec 09 '14 at 21:48
  • Maybe this will help: http://stackoverflow.com/a/2336769/120163 – Ira Baxter Dec 09 '14 at 22:03
  • @markspace: "build a parse tree as you lex..." This doesn't make any sense. – Ira Baxter Dec 09 '14 at 22:11
  • Ok, I got it, I'm forced to parse per character (since scan.next() may give invalid tokens such as "(GET" which I would be forced to parse again to get valid tokens ). Thanks for the help. – fedexist Dec 11 '14 at 11:24

2 Answers2

1

Solution proposed by Ira Braxter:

Your title is extremely confused. You appear to want to parse what are commonly called "S-expressions" in the LISP world; this takes a (simple but) context-free grammar. You cannot parse such expressions with regexps. Time to learn about real parsers.


Maybe this will help: stackoverflow.com/a/2336769/120163

Community
  • 1
  • 1
Stephan
  • 41,764
  • 65
  • 238
  • 329
1

In the end, I understood thanks to Ira Baxter that this context free grammar can't be parsed with RegExp and I used the concepts of S-Expressions to build up the interpreter, whose source code you can find here. If you have any question about it (mainly because the comments aren't translated in english, even though I think the code is pretty clear), just message me or comment here.

Basically what I do is:

  • Parse every character and tokenize it (e.g '(' -> is OPEN_PAR, while "SET" -> STATEMENT_SET or a random letter like 'b' is parsed as a VARIABLE )
  • Then, I use the token list created to do a syntactic analysis, which checks the patterns occuring inside the token list, according to the grammar
  • If there's an expression inside the statement, I check recursively for any expression inside an expression, throwing an exception and going to the following correct statement if needed
  • At the end of analysing every single statement, I compute the statement as necessary as for specifications
fedexist
  • 155
  • 2
  • 12