Regular Expression Matching Can Be Simple And Fast by Russ Cox is a great introduction to building recognizers with finite state machines. He shows how to go from regular expression to non-deterministic finite automata to deterministic finite automata. His reference implementation uses a directed graph (similar to a linked list but each node can have more than one "out" reference and may reference any other node including itself) versus a linked list. There are other ways to model a state machine including:
You build a lexer/scanner by combining several recognizers. For example, imagine an assignment-only language with the following tokens defined by regular expressions:
- identifier : [a-zA-Z]+
- assignment : =
- number: [0-9]+
- keyword : and
As you scan input left-to-right, move through the transitions in each token's machine based on the current symbol. When there are no valid transitions for a symbol, pick the last machine in an accepting state. All symbols scanned until that state are part of the token. Scanning starts again at the symbol after the last accepted symbol. If there are no valid transitions for a new token, the input is invalid and the lexer should report an error. If there is more than one machine in an accepting state, a precedence rule should decide which token is used.
Note: these steps describe a lexer that always returns the longest possible match. You could also design a lexer that returns a token as soon as one of its machines is in an accepting state.
Results on sample inputs:
- a=10 : (identifier a)(assignment =)(number 10)
- a = 10 : invalid - whitespace isn't accepted in our token definitions
- 25=xyz : (number 25)(assignment)(identifier xyz)
- 25and42 : (number 25)(keyword and)(number 42) - assumes keyword takes precedence over identifier
- b=1+2 : invalid - '+' isn't accepted in our token definitions
- andx=8 : (identifier andx)(assignment)(number 8) - longest match gives us (identifier andx) versus (keyword and)(identifier x)
Notice that lexical analysis just returns tokens. It doesn't know if the tokens are used properly. '25=xyz' may not make any sense but we have to wait until the parsing phase to know for sure.
As an additional resource, Dick Grune offers the first edition of Parsing Techniques - A Practical Guide as Postscript and Pdf.