premise: I think that a much better way to learn about the concepts you speak of could be to put the available tools to work. You could start with the basics, like flex/bison, or -better- some recent free tool, like GOLD.
Today lexers and parsers aren't implemented by hand, due to the complexity of the underlying state machines. But for learning purpose, as advocated for instance by Niklaus Wirth, the 'father' of Pascal, or to 'bootstrap' the SW chain tools, or (in the past) also for efficiency reasons, the lexer/parser was sometime implemented by hand. The lexer usually take the form of many big switch(s), and the lookahead can be naively implemented using ungetc():
#include <stdio.h>
enum token {Num, Sym};
token getnum() {
char c;
for ( ; ; )
switch (c = getc()) {
case '0':
...
case '9':
break;
default:
ungetc(c);
return Num;
}
}
token getsym() {
char c;
for ( ; ; )
switch (c = getc()) {
case '0':
...
case '9':
break;
default:
ungetc(c);
return Sym;
}
}
token lex() {
char c;
switch (c = getc()) {
case '0':
...
case '9':
return getnum();
case 'A':
...
case 'Z':
case 'a':
...
case 'z':
case '_':
return getsym();
}
The simplest parser it's the top-down recursive, that requires your grammar being LL(1). Notably the Pascal parser was of this kind (of course, Wirth designed appropriately the language). We need a function for each nonterminal, and a lookahead of 1 token. The parser becomes a set of mutually recursive procedures (totally fictious pseudocode here):
void A(token t);
void B(token t);
void B(token t)
{
if (t == Sym)
{
t = lex();
B(t);
}
else if (t == Num)
{
t = lex();
A(t);
}
else
syntaxerr();
}
void A(token t)
{
if (t == Num)
{
t = lex();
B(t);
}
else
syntaxerr();
}
int main(int argc, char **argv)
{
A(lex());
}
We can always rewrite EBNF in BFN, introducing service non terminals.