I'm doing an assignment for school right now that focuses on building a tokenizer and parser for simple sets of instructions. This instruction set uses EBNF grammar which we have to output to the user. As a simple example, imagine this instruction set code looks like this:
set 0, 2 * (4 + 20)
halt
And the EBNF grammar is given like this:
<Program> -> <Statement> { <Statement> }
<Statement> -> <Set> | 'halt'
<Set> -> set ('write'|<Expr>), ('read'|<Expr>)
<Expr> -> <Term> {(+|-) <Term>}
<Term> -> <Factor> {(*|/|%) <Factor>}
<Factor> -> <Number> | 'D['<Expr>']' | '('<Expr>')'
<Number> -> 0 | (1...9){0...9}
In this case, <...>
is what I output to the user, {...}
means choose 0 or more, (...)
means choose 1 or more and '...' is a string literal.
So in this case, when running the program, the correct output should look like this:
Program
Statement
Set // set
Expr // 0
Term // 0 is a term
Factor // 0 is a factor
Number // 0 is finally a number
Expr // go to the next string, "2*(4+2)"
Term // start with the number 2
Factor // 2 is a factor
Number // 2 is a number
Factor // after 2, we have a * so after that is a factor
Expr // the factor leads to a (...) which leads to an expr
Term // the expr leads to a term 4
Factor // 4 is a factor
Number // 4 is returned as a number
Term // we have a + so the next string (20) is a term
Factor // 20 is a factor
Number // return 20 as a number
Statement // halt is a statement, end
I've already built the tokenizer portion of the assignment, so running the above instruction set produces a std::vector<std::string>
that looks like this (where each new string is separated by a new line):
set
0
2*(4+20)
halt
Now the parser is where I am stuck. I first push each string into a std::queue
, examine the string, parse it, and then pop it from the queue. I already have it so that a set
statement pushes the string "Set" in another std::vector<std::string>
so I can print it out later.
I also have a function, parse_expression(std::string& str)
to parse my expressions. The real trouble I have is with how I could correctly parse the 2 * (4 + 20)
part of it.
My teacher told me to go character by character through the string, check whether it is a number (which I know how to do), or if it matches a '+', '-', '/', '*', '%' character, or if the next character is a 'D' then I should parse another expression.
I'm sort of confused on how I should attempt this though. If I go character by character, I can get the first number, go until I hit a non number character, and then just push out Term
followed by Factor
followed by Number
. But how can I then go back to that original character and keep pushing onwards towards the next ones. Almost like, how do I go back to the original level I was at so that I can correctly determine whether or not I need to create another Expr, or whether or not I should just be doing another term.
I realize this is a long, confusing question, but I would appreciate any push in the right direction, whether it be something glaringly obvious that I'm doing wrong, or whether it be how a parser actually works.