1

I am creating a tokeniser in ML-Lex a part of the definition of which is

datatype lexresult = STRING
                     | STRINGOP
                     | EOF
val error = fn x => TextIO.output(TextIO.stdOut,x ^ "\n")
val eof = fn () => EOF

%%
%structure myLang
digit=[0-9];
ws=[\ \t\n];
str=\"[.*]+\";
strop=\[[0-9...?\^]\];
%s alpha;
alpha=[a-zA-Z];
%%

<alpha> {alphanum}+ => (ID);
. => (error ("myLang: ignoring bad character " ^ yytext); lex());

I want that the type ID should be detected only when it starts with or is found after "alpha". I know that writing it as

{alpha}+ {alphanum}* => (ID);

is an option but I need to learn to use the use of start states as well for some other purposes. Can someone please help me on this?

Chandan
  • 166
  • 2
  • 18

1 Answers1

1

The information you need is in the documentation which comes with SML available in various places. Many university courses have online notes which contain working examples.

The first thing to note from your example code is that you have overloaded the name alpha and used it to name a state and a pattern. This is probably not a good idea. The pattern alphanum is not not defined, and the result ID is not declared. Some basic errors which you should probably fix before thinking about using states - or posting a question here on SO. Asking for help for code with such obvious faults in it is not encouraging help from the experts. :-)

Having fixed up those errors, we can start using states. Here is my version of your code:

datatype lexresult = ID
                     | EOF
val error = fn x => TextIO.output(TextIO.stdOut,x ^ "\n")
val eof = fn () => EOF

%%
%structure myLang
digit=[0-9];
ws=[\ \t\n];
str=\"[.*]+\";
strop=\[[0-9...?\^]\];
%s ALPHA_STATE;
alpha=[a-zA-Z];
alphanum=[a-zA-Z0-9];
%%

<INITIAL>{alpha} => (YYBEGIN ALPHA_STATE; continue());
<ALPHA_STATE>{alphanum}+ => (YYBEGIN INITIAL; TextIO.output(TextIO.stdOut,"ID\n"); ID);

. => (error ("myLang: ignoring bad character " ^ yytext); lex());

You can see I've added ID to the lexresult, named the state ALPHA_STATE and added the alphanum pattern. Now lets look at how the state code works:

There are two states in this program, they are called INITIAL and ALPHA_STATE (all lex programs have an INITIAL default state). It always begins recognising in the INITIAL state. Having a rule <INITIAL>{alpha} => indicates that if you encounter a letter when in the initial state (i.e. NOT in the ALPHA_STATE) then it is a match and the action should be invoked. The action for this rule works as follows:

YYBEGIN ALPHA_STATE;          (* Switch from INITIAL state to ALPHA_STATE *)
continue()                    (* and keep going *)

Now we are in ALPHA_STATE it enables those rules defined for this state, which enable the rule <ALPHA_STATE>{alphanum} =>. The action on this rule switch back to the INITIAL state and record the match.

For a longer example of using states (lex rather than ML-lex) you can see my answer here: Error while parsing comments in lex.

To test this ML-LEX program I referenced this helpful question: building a lexical analyser using ml-lex, and generated the following SML program:

use "states.lex.sml";
open myLang
val lexer =
let 
  fun input f =
      case TextIO.inputLine f of
        SOME s => s
      | NONE => raise Fail "Implement proper error handling."
in 
  myLang.makeLexer (fn (n:int) => input TextIO.stdIn)
end
val nextToken = lexer();

and just for completeness, it generated the following output demonstrating the match:

c:\Users\Brian>"%SMLNJ_HOME%\bin\sml" main.sml
Standard ML of New Jersey v110.78 [built: Sun Dec 21 15:52:08 2014]
[opening main.sml]
[opening states.lex.sml]
[autoloading]
[library $SMLNJ-BASIS/basis.cm is stable]
[autoloading done]
structure myLang :
  sig
    structure UserDeclarations : <sig>
    exception LexError
    structure Internal : <sig>
    val makeLexer : (int -> string) -> unit -> Internal.result
  end
val it = () : unit
hello
ID
Community
  • 1
  • 1
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129