Lexer and parser in C++ from EBNF

Question

I need to write a lexer and a parser for a given grammar (I need to handcraft it not with generators). I have done a lot of research but I still can't figure out how to code it.

For example I have (grammar in EBNF):

<Letter> ::= [A-Za-z]

<IntegerLiteral> ::=<Digit> { <Digit> }

Does this need to be defined in the lexer or in the parser? And how?

I know that a lexer should read a file character by character and output tokens then these tokens are passed to the parser to create the parse tree however I am getting stuck in the coding.

"I need to handcraft it not with generators": does that preclude you from using boost spirit? (www.boost.org). — Bathsheba, Apr 17 '14 at 09:36
_'is it a generator?'_ Not in the sense that anything else than the c++ compiler needs to be used. — πάντα ῥεῖ, Apr 17 '14 at 10:18
if you are interested in that topic looking at other compilers handcrafting their stuff I guess the Lexer Classes of clang are good sources of inspiration. — Alexander Oh, Apr 17 '14 at 10:38
See my SO answer on how to write a recursive descent parser, using the EBNF as a direct guide for telling you what to code: http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 — Ira Baxter, Jun 25 '16 at 10:57

score 2 · Answer 1 · answered Apr 17 '14 at 10:30

What you show us looks like it defines token types. So it goes in the lexer.

The trick in writing a lexer is simply to take your input text (which is simply a long stream of individual characters) and look at them, one by one. Every time you look at a character, classify it according to the EBNF above (i.e. is it a Letter or an IntegerLiteral) then generate the appropriate token.

Now your grammar above sounds like a pretty pointless one (it generates single-character and single-digit tokens) So my guess is that you have more rules like this one that use these rules to make the definition more readable. So implement those more complex rules. Write a function for detecting whether a character matches one of the sub-rules.

Whenever you find that the current character doesn't match the previous character's type, finish the current one and start a new one.

That's pretty much all there is to it. You just need a bunch of booleans to keep track of the types.

is there a site which demostrates how this can be done? And why booleans to keep track of the types? Ans yes I do have more complex rules I just wrote the 2 simple ones. Also how can you classify a character to the above EBNF? — user3505334, Apr 17 '14 at 14:45
Because booleans are the simplest type one uses to remember something that's on or off (i.e. am I in something that may be an integer or a double, was I in an integer but just encountered a period so it can now only be a fractional number, etc.). You could also implement it as a full-blown state machine, but whenever I mention the word "state machine" people run scared, so I thought I'd leave that out of the explanation. — uliwitness, Apr 22 '14 at 17:21

Lexer and parser in C++ from EBNF

1 Answers1