What is the procedural way to arrange lex and parse files?

Question

I searched quite a bit, and still can't seem to grasp the work flow, and file arranging of tokens, the lexer, and the parser. I'm using plain C with Visual Studio. Went through countless tutorials, and they seem to skip this part.

So when you write a simple function in the source.txt, the lexer is in a C file, reads the source file, and breaks the simple function into tokens?

In Dr. Dobbs book, there is a file that contains a list of already defined tokens that is written into a def file, if the previous question is true, where does this predefined token list fit in?

Once the source file is read by lex, and the tokens are split, how does the parser retrieve the tokenized info lex is putting out, and then make the grammar?

I've seen typedef struct used for what seemed like a slight breakdown of predefined tokens, and somewhat of the lex inside. So are the tokens in 1 file, the lex in 1, and the parser in 1? Or the tokens in 1 but lex and parse together calling that token file?

I just need clarification, a good point in the right direction, any help is much appreciated.

Hi, welcome to stack overflow. Please refer the [ask] link for more details on how to ask a question and update your question accordingly. — Jeroen Heier, Jun 24 '18 at 08:25
Don't read only "tutorials", read *books*, learn the *theory* (discrete math), and read some source for existing (and **simple**) compilers, translators and interpreters. And as with everything in the programming world, there's no one single true way to do things. — Some programmer dude, Jun 24 '18 at 08:32

Ira Baxter · Accepted Answer · 2018-06-24T10:53:39.337

A simple classic way to organize these tasks is by setting the various parts up as subroutines passing datastructures to one another. Forgive my sloppy syntax.

First you need a definitino of a token, produced by the lexer. This is almost always a struct with an enum to indicate which token type, and a union type to carry any value the token type may have:

 struct Token {
     enum  // distinguishes token types
          { EndOfFile, Integer, String, Float, Identifier
            Semicolon, Colon, Plus, Minus, Times, Divide, LefParen, RightParen,
            KeywordBegin, KeywordDeclare, KeywordEnd, ...
          } tokentype
      union {
         long numeric_value; // holds numeric value of integer-valued token
         char* string_value; // holds ptr to string body of identifiers or string literals
         float float_value; // holds numeric value of floating-point valued token
            } tokenvalue
      }

You'll want to build an abstract syntax tree. For that you'll need a TreeNode type. Like Tokens, these are almost always an enum to indicate which node type, and a union to hold various properties of the node type, and finally a list of pointers to children:

      struct TreeNode {
          enum // distiguishes tree node types
             { Goal, Function, StatementList, Statement, LeftHandSide, Expression,
               Add, Subtract, Times, Divide, Identifier, FunctionCall, ...
             } nodetype
          children* TreeNode[4];  // a small constant here is almost always enough
          union // hold values specific to node type, often includes a copy of lexer union
             { long numeric_value; // holds:
                       // numeric value of integer-valued token
                       // index of built-in function number
                       // actual number of children if it varies
                       // ...
               char* string_value; // holds ptr to string body of identifiers or string literals
               float float_value; // holds numeric value of floating-point valued token
            } nodevalue
        }

MyCompiler.c contains the following code:

     int main() {
            filehandle Handle = OpenSourceFile(&FileName);
            ASTrootnode TreeNode = Parser(Handle);
            CodeGenerator(ASTrootnode);
            exit();
     }

Parser.c contains the following code:

     TreeNode Parser(filehandle Handle) {
            <parsingmachinery>
            nexttoken Token=Lexer(filehandle);
            <moreparsingmachinery to build tree nodes>
            return toplevel_TreeNode;
     }

Lexer.c contains the following code:

    Token Lexer(filehandle Handle) {
         token Token;
         <process characters in buffer>
         if bufferp=end_of_buffer
            fileread(filehandle,&bufferbody,bufferlength);
         <process characters in buffer>
         token.tokentype=<typeofrecognizedtoken>
         token.tokenvalue.long=<valueofnumerictoken>
         ...
         return Token;
    }

You'll obviously want to put the Token and TreeNode declarations into header files that can be shared across your compiler source files.

If you build a high-performance compiler, you'll want to optimize these routines. One trivial example: the FileHandle can become a global variable, and thus need not be passed between the pieces as an explicit argument. One not so trivial example: you'll want to a high-performance lexer generator, or hand-code the lexer, to maximize its character processing rate, especially at skipping blanks and comments.

If you want to see specific details on how to construct a parser that builds ASTs, see my SO answer on building recursive descent parsers: https://stackoverflow.com/a/2336769/120163

Thank you for taking the time to explain it more fully. I appreciate it alot. Its been weeks of going through web pages, and books available. I am excited!!! Repumped now. — , Jun 24 '18 at 20:21
If you like my answer, you can upvote it. If you really like it, you can mark it as the accepted answer. — Ira Baxter, Jun 24 '18 at 22:05
Lol, I thought I did accept answer. Not enough rep points to upvote. You know any good sources to help learn the struct part? — , Jun 25 '18 at 01:24
Sounds to me like you haven't had a data structures class.Without that, you're not going to go very far in building a compiler. I suggest you get a book and read it, hard: https://mitpress.mit.edu/books/introduction-algorithms-third-edition — Ira Baxter, Jun 25 '18 at 01:54
I have a few. Just havent made it that far into them yet. Thanks man. You have been a great help. — , Jun 25 '18 at 02:20

score 0 · Answer 2 · answered Jun 24 '18 at 09:48

Yes, the second example and understanding you cite is quite typical with one adjustment.

Yes, the pre-defined token list (usually in an enum if allowed) exist in header or similar. And both the lex and parser can reference that header file.

But the tokens actually discovered by the 'lex' are passed on to the parser at execution time.

What is the procedural way to arrange lex and parse files?

2 Answers2