Writing an extremely simple parser

Question

I'm writing a very basic web server that has to support an extremely limited special server side scripting language. Basically all I need to support is "echo", addition/subtraction/multiplication (no division) with only 2 operands, a simple "date()" function that outputs the date and the use of the "&" operator to concatenate strings.

An example could be:

echo "Here is the date: " & date();
echo "9 x 15 = : & 9*15;

I've gone through and created the code necessary to generate tokens, but I'm not sure I'm using the right tokens.

I created tokens for the following:

ECHO - The echo command
WHITESPACE - Any whitespace
STRING - A string inside quotations
DATE - The date() function
CONCAT - the & operator for concatenation
MATH - Any instance of binary operation (5+4, 9*2, 8-2, etc)
TERM - The terminal character (;)

The MATH one I am particularly unsure about. Typically I see people create a token specifically for integers and then for each operator as well, but since I ONLY want to allow binary operations, I thought it made sense to group it into one token. If I were to do everything separately, I would have to do some extra work to ensure that I never accepted "5+4+1".

So question 1 is am I on the right track with which tokens to use?

My next question is what to I do with these tokens next to ensure correct syntax? The approach that I had thought of was to basically say, "Okay I know I have this token, here is a list of tokens that are allowed to come next based on the current token. Is the next token in the list?"

Based on that, I made a list of all of my tokens as well as what tokens are valid to appear directly after them (didn't include whitespace for simplicity).

ECHO        ->      STRING|MATH|DATE
STRING      ->      TERM|CONCAT
MATH        ->      TERM|CONCAT
DATE        ->      TERM|CONCAT
CONCAT      ->      STRING|MATH|DATE

The problem is I'm not sure at all how to best implement this. Really I need to keep track of whitespace as well to make sure there are spaces between the tokens. But that means I have to look ahead two tokens at a time which is getting even more intimidating. I also am not sure how to manage the "valid next tokens" stuff without just some disgusting section of if blocks. Should I be checking for valid syntax before trying to actually execute the script, or should I do it all at once and just throw an error when I reach an unexpected token? In this simple example, everything will always work just fine parsing left to right, there's no real precedence rules (except the MATH thing, but that's part of why I combined it into one token even though it feels wrong.) Even so, I wouldn't mind designing a more scalable and elegant solution.

In my research about writing parsers, I see a lot of references to creating "accept()" and "expect()" functions but I can't find any clear description of what they are supposed to do or how they are supposed to work.

I guess I'm just not sure how to implement this, and then how to actually come up with a resulting string at the end of the day.

Am I heading in the right direction and does anybody know of a resource that might help me understand how to best implement something simple like this? I am required to do it by hand and cannot use a tool like ANTLR.

Thanks in advance for any help.

You are in luck my friend, someone already did the hard part. http://irony.codeplex.com/ — asawyer, Nov 09 '12 at 15:35
You may also use Javascript http://stackoverflow.com/questions/12118077/using-javascript-for-custom-purposes — L.B, Nov 09 '12 at 15:36
@asawyer, I think you missed the part *"I am required to do it by hand and cannot use a tool like ANTLR"*, so Irony is most probably also not allowed... — Bart Kiers, Nov 09 '12 at 15:36
@BartKiers Maybe, but since it generates a normal c# assembly I thought it might be workable. — asawyer, Nov 09 '12 at 15:38
Is this homework? It's fine if it is, but you should use the [tag:homework] tag. — Justin Morgan - On strike, Nov 09 '12 at 16:25
Yeah it's part of a project in school. I just went to add the tag but it says the tag is obsolete and should not be used so I haven't added it yet. If it's still in use though I will certainly add it! — ARW, Nov 09 '12 at 16:28
See my answer on how to build recursive descent parsers. They're actually pretty easy to build. See http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 — Ira Baxter, Nov 09 '12 at 17:13
Why are you going to make it complicated? Use seperate tokens, keep it simple and keep it smart, easy as pie. — Security Hound, Nov 09 '12 at 17:25

score 2 · Accepted Answer · answered Nov 09 '12 at 16:15

The first thing that you need to do is to discard all the white-spaces (except for the ones in strings). This way, when you add tokens to the list of tokens, you are sure that the list contains only valid tokens. For example, consider this statement:

echo "Here is the date: " & date();

I will start tokenizing and first separate echo based on the white-space (yes, white-space is needed here to separate it but isn't useful after that). The tokenizer then encounters a double quote and continues reading everything until the closing double quote is found. Similarly, I create separate tokens for &, date and ().

My token list now contains the following tokens:

echo
"Here is the date: "
&
date
()

Now, in the parsing stage, we read these tokens. The parser loops through every token in the token list. It reads echo and checks if it is valid (based on the rules / functions you have for the language). It advances to the next token and sees if it is either of the date, string or math. Similarly, it checks the rest of the tokens. If at any point, a token is not supposed to be there, you can throw an error indicating syntax error or something.

For the math statement tokenization, only combine the expression that is contained in a bracket and rest of operands and operators separately. For example: 9/3 + (7-3+1) would have the tokens 9, /, 3, +, and (7-3+1). As every token has its own priority (that you define in the token struct), you can start evaluating from the highest priority token down to the lowest token priority. This way you can have prioritized expressions. If you still have confusion, let me know. I'll write you some example code.

Thank you very much this is definitely helpful. I am going to take a stab at it this afternoon and if I'm still having problems I'll take you up on the sample code offer. Thanks again! — ARW, Nov 09 '12 at 16:19

prprcupofcoffee · Answer 2 · 2012-11-09T16:30:43.017

expect is what your parser does to get the next token, and fails if the token isn't a proper following token. To begin with, your parser expects ECHO or WHITESPACE. Those are the only valid starting terms. Having seen "ECHO", your parser expects one of WHITESPACE|STRING|MATH|DATE; anything else is an error. And so on.

accept is when your parser has seen a complete "statement" - ECHO, followed by a valid sequence of tokens, followed by TERM. Your parser now has enough information to process your ECHO command.

Oh, and hand-written parsers (especially simple ones) are very often disgusting collections of if blocks (or moral equivalents like switch statements) :) Further up the line of elegant-ness would be some kind of state machine, and further up from that is a grammar generator like yacc or GOLD Parser Generator (which in turn churn out ugly if, switch, and state machines for you).

EDIT to provide more details.

To help sort out responsibilities, create a "lexer" whose job is to read the input and produce tokens. This involves deciding what tokens look like. An easy token is the word "echo". A less easy token is a math operation; the token would consist of one or more digits, an operator, and one or more digits, with no whitespace between. The lexer would take care of skipping whitespace, as well as understanding a quoted string and the characters that form the date() function. The lexer would return two things - the type of token read and the value of the token (e.g., "MATH" and "9*15").

With a lexer in hand to read your input, the parser consumes the tokens and ensures they're in a proper order. First you have to see the ECHO token. If not, fail with an error message. After that, you have to see STRING, DATE, or MATH. If not, fail with an error message. After that, you loop, watching for either TERM, or else CONCAT followed by another STRING, DATE, or MATH. If you see TERM, break the loop. If you see neither TERM nor CONCAT, fail with an error message.

You can process the ECHO command as you're parsing, since it's a simple grammar. Each time you find a STRING, DATE or MATH, evaluate it and concatenate it to what you already have. When you find TERM, exit the function and return the built-up string.

Questions? Comments? Omelets? :)

Thank you David that is definitely helpful. What I'm still struggling with is how this loop that iterates across the tokens works. What would an actual implementation of "expect" look like for example? After WHITESPACE, what I'm expecting really depends on what the token before WHITESPACE was, which is where it gets confusing. How is "expect" actually used, is it a function that takes a list of possible tokens as a parameter and returns true/false? Does it also change my "current" token to be whatever "expect" just read, then I call expect again? Just confused about the implementation :/ — ARW, Nov 09 '12 at 15:59

Writing an extremely simple parser

2 Answers2