3

I am trying to learn the concepts and how to create a lexical analyser and parser in C from BNF notation, not EBNF. I would like to learn it in the language of C.

Can anyone explain to me the parts of the BNF i use to put in the lexical analyser and the parser in C and where to put them please? Such as an example used as well maybe? I have found that in the parser, you put terminals, non terminals, tokens, types, etc...

Sorry if thats unclear or anything, my head is all over the place in this

Thank you

ps. BNF i have

<for_statement> ::= FOR <identifier> 
IS <expression> BY <expression> TO <expression> DO <statement_list> ENDFOR

Lexical analyser code snippit

ENDP                   printf("keyword: ENDP\n");
DECLARATIONS           printf("keyword: DECLARATIONS\n");
CODE                   printf("keyword: CODE\n");
"OF TYPE"              printf("keyword: OF TYPE\n");
Tanya Tazzy Hegarty
  • 111
  • 3
  • 6
  • 15
  • 2
    The question is a bit vague. Read the [gnu-lex and Bison manual pages](http://dinosaur.compilertools.net/) to get started. Feel free to post a specific question when you run into problems. – Bart Kiers Nov 19 '11 at 11:20

1 Answers1

2

premise: I think that a much better way to learn about the concepts you speak of could be to put the available tools to work. You could start with the basics, like flex/bison, or -better- some recent free tool, like GOLD.

Today lexers and parsers aren't implemented by hand, due to the complexity of the underlying state machines. But for learning purpose, as advocated for instance by Niklaus Wirth, the 'father' of Pascal, or to 'bootstrap' the SW chain tools, or (in the past) also for efficiency reasons, the lexer/parser was sometime implemented by hand. The lexer usually take the form of many big switch(s), and the lookahead can be naively implemented using ungetc():

#include <stdio.h>
enum token {Num, Sym};

token getnum() {
 char c;
 for ( ; ; )
  switch (c = getc()) {
  case '0':
  ...
  case '9':
   break;
  default:
   ungetc(c);
   return Num;
  }
}
token getsym() {
 char c;
 for ( ; ; )
  switch (c = getc()) {
  case '0':
  ...
  case '9':
   break;
  default:
   ungetc(c);
   return Sym;
 }
}
token lex() {
 char c;
 switch (c = getc()) {
 case '0':
 ...
 case '9':
  return getnum();
 case 'A':
 ...
 case 'Z':
 case 'a':
 ...
 case 'z':
 case '_':
  return getsym();
 }

The simplest parser it's the top-down recursive, that requires your grammar being LL(1). Notably the Pascal parser was of this kind (of course, Wirth designed appropriately the language). We need a function for each nonterminal, and a lookahead of 1 token. The parser becomes a set of mutually recursive procedures (totally fictious pseudocode here):

void A(token t);
void B(token t);

void B(token t)
{
 if (t == Sym)
 {
  t = lex();
  B(t);
 }
 else if (t == Num)
 {
  t = lex();
  A(t);
 }
 else
  syntaxerr();
}
void A(token t)
{
 if (t == Num)
 {
  t = lex();
  B(t);
 }
 else
  syntaxerr();
}

int main(int argc, char **argv)
{
 A(lex());
}

We can always rewrite EBNF in BFN, introducing service non terminals.

CapelliC
  • 59,646
  • 5
  • 47
  • 90