Writing Compilers, Lexical Analysis?

Question

I'm completely new to writing compilers. So I am currently starting the project (coded in Java), and before coding, I would like to know more about the lexical analysis part. I have researched on the web, I found out that most of them use tokenizers.

The project requires that I do not use them (tokenizers), and instead use finite state automata. I am confused, do I need to implement it using linked lists? or a simple nested switch cases will do. I am not really that familiar with implementing finite automaton, what are the advantages?

The big compiler resources list is [Learning to write a compiler](http://stackoverflow.com/q/1669/2509). The various theory books/sites/etc therein will explain how to build a lexer with FSA (or you can probably suss out the solutions to straight forward task yourself). There are presumable multiple question on implementing FSA in Java. — dmckee --- ex-moderator kitten, Dec 09 '10 at 17:24
It may be helpful to look at the code generated by some lexer generator. E.g. [COCO/R for Java](http://www.ssw.uni-linz.ac.at/Coco/#Java) produces quite readable code. — max taldykin, Dec 10 '10 at 14:13
@max taldykin, thanks for the pointer to COCO/R, it looks interesting. To the OP, scanner.java in the sources for that _is_ a FSA scanner for the parser format itself, so might provide some inspiration. — The Archetypal Paul, Dec 10 '10 at 14:49

score 1 · Answer 1 · answered Dec 09 '10 at 17:26

I assume this is an assignemnt given the otherwise artificial ban on tokenizers.

There are a lot of google matches for "lexical analysis using finite state automata". What are these missing?

Are you having problems seeing how finite state automata can be used for lexical analysis? Or writing an automaton yourself? It may help to know they're also known as finite state machines(FSM)

A tokenizer may well use a FSM in its internals, so I don't understand why you say you must use an FSM but not a tokenizer - does this mean you can't use a tokenizer you've written and must write one yourself?

The implementation of a regular expression matcher is also normally a finite state machine, so that may be an area to think about.

lex (and its more recent relation flex) are lexical analysers for which the source is available. You could look at those for ideas

actually, lex and flex are not lexical analyzers but lexical analyzer *generators*. so, it can be confusing to look at their source :) — max taldykin, Dec 10 '10 at 14:09
True. I should have been more specific. The output of such tools might be also worth inspection — The Archetypal Paul, Dec 10 '10 at 14:19

score 1 · Accepted Answer · answered Dec 14 '10 at 23:44

Regular Expression Matching Can Be Simple And Fast by Russ Cox is a great introduction to building recognizers with finite state machines. He shows how to go from regular expression to non-deterministic finite automata to deterministic finite automata. His reference implementation uses a directed graph (similar to a linked list but each node can have more than one "out" reference and may reference any other node including itself) versus a linked list. There are other ways to model a state machine including:

You build a lexer/scanner by combining several recognizers. For example, imagine an assignment-only language with the following tokens defined by regular expressions:

identifier : [a-zA-Z]+
assignment : =
number: [0-9]+
keyword : and

As you scan input left-to-right, move through the transitions in each token's machine based on the current symbol. When there are no valid transitions for a symbol, pick the last machine in an accepting state. All symbols scanned until that state are part of the token. Scanning starts again at the symbol after the last accepted symbol. If there are no valid transitions for a new token, the input is invalid and the lexer should report an error. If there is more than one machine in an accepting state, a precedence rule should decide which token is used.

Note: these steps describe a lexer that always returns the longest possible match. You could also design a lexer that returns a token as soon as one of its machines is in an accepting state.

Results on sample inputs:

a=10 : (identifier a)(assignment =)(number 10)
a = 10 : invalid - whitespace isn't accepted in our token definitions
25=xyz : (number 25)(assignment)(identifier xyz)
25and42 : (number 25)(keyword and)(number 42) - assumes keyword takes precedence over identifier
b=1+2 : invalid - '+' isn't accepted in our token definitions
andx=8 : (identifier andx)(assignment)(number 8) - longest match gives us (identifier andx) versus (keyword and)(identifier x)

Notice that lexical analysis just returns tokens. It doesn't know if the tokens are used properly. '25=xyz' may not make any sense but we have to wait until the parsing phase to know for sure.

As an additional resource, Dick Grune offers the first edition of Parsing Techniques - A Practical Guide as Postscript and Pdf.

Writing Compilers, Lexical Analysis?

2 Answers2