How to return multiple tokens for one fslex rule pattern?

Question

Using fslex I would like to return multiple tokens for one pattern but I don't see a way how to accomplish that. Even to use another rule function that returns multiple tokens would work for me.

I am trying to use something like this:

let identifier = [ 'a'-'z' 'A'-'Z' ]+

// ...

rule tokenize = parse
// ...
| '.' identifier '(' { let value = lexeme lexbuf
                       match operations.TryFind(value) with
                      // TODO: here is the problem:
                      // I would like to return like [DOT; op; LPAREN]
                      | Some op -> op
                      | None    -> ID(value) }

| identifier         { ID (lexeme lexbuf) }
// ...

The problem I am trying to solve here is to match for predefined tokens (see: operations map) only if the identifier is between . and (. Otherwise the match should be returned as an ID.

I am fairly new to fslex so I am happy for any pointers in the right direction.

Damn, I always seem to confuse these two ;-) I hope the question makes some sense anyway. — kongo2002, Dec 19 '12 at 13:32
This can be solved (though shouldn't be, you should probably redesign your lexer) - if no one else does, I'll post the solution once I get to a comfortable keyboard. — Ramon Snir, Dec 19 '12 at 13:44

score 4 · Answer 1 · answered Dec 19 '12 at 14:22

Okay, here it is.

Each lexer rule (i.e. rule <name> = parse .. cases ..) defined a function <name> : LexBuffer<char> -> 'a, where 'a can be any type. Usually, you return tokens (possibly defined for you by FsYacc), so then you can parse text like that:

let parse text =
    let lexbuf = LexBuffer<char>.FromString text
    Parser.start Lexer.tokenize lexbuf

Where Parser.start is the parsing function (from your FsYacc file), of type (LexBuffer<char> -> Token) -> LexBuffer<char> -> AST (Token and AST are your types, nothing special about them).

In your case, you want <name> : LexBuffer<char> -> 'a list, so then all you have to do is this:

let parse' text =
    let lexbuf = LexBuffer<char>.FromString text
    let tokenize =
        let stack = ref []
        fun lexbuf ->
        while List.isEmpty !stack do
            stack := Lexer.tokenize lexbuf
        let (token :: stack') = !stack // can never get match failure,
                                        // else the while wouldn't have exited
        stack := stack'
        token
    Parser.start tokenize lexbuf

This simply saves the tokens your lexer supplies, and gives them to the parser one-by-one (and generates more tokens as needed).

Thanks for the insight! I guess this is essentially what I was asking for, although I do have the feeling (like you mentioned above) that I have to modify my lexer logic. But I am stuck at that point I fear. — kongo2002, Dec 19 '12 at 15:45
I wrote a lexer or two like that for a while (and the F# compiler also intercepts the tokens and modifies them a bit), but usually it means you're missing something in your general design. By the way, if all you want to solve is this specific scenario, I shall post an alternative. — Ramon Snir, Dec 19 '12 at 15:56

Stephen Swensen · Answer 2 · 2012-12-19T15:07:22.493

3

Try to keep semantic analysis like "...only if the identifier is between . and (" out of your lexer (fslex), and instead save it for your parser (fsyacc). i.e. one option would be to keep your lexer ignorant of operations:

let identifier = [ 'a'-'z' 'A'-'Z' ]+    
// ...
rule tokenize = parse
// ...
| '.' { DOT }
| '(' { LPAREN }
| identifier { ID (lexeme lexbuf) }
// ...

and then in fsyacc solve the problem with a rule like:

| DOT ID LPAREN { match operations.TryFind($2) with
                  | Some op -> Ast.Op(op)
                  | None    -> Ast.Id($2) }

UPDATE in response to comment:

Perhaps the following then in your lexer:

let identifier = [ 'a'-'z' 'A'-'Z' ]+   
let operations =
  [
    "op1", OP1
    "op2", OP2
    //...
  ] |> Map.ofList 

// ...
rule tokenize = parse
// ...
| '.' { DOT }
| '(' { LPAREN }
| identifier 
  { 
    let input = lexeme lexbuf
    match keywords |> Map.tryFind input with
    | Some(token) -> token
    | None -> ID(input) 
  }
// ...

and in your parser:

| DOT ID LPAREN { ... }
| DOT OP1 LPAREN { ... }
| DOT OP2 LPAREN { ... }

Thus you have enforced the rule that IDs and operations must come between a DOT and a LPAREN in your parser while keeping your lexer simple as it should be (to provide a stream of tokens, with little in the way of enforcing the validity of the tokens in relation to each other).

edited Dec 19 '12 at 15:07

answered Dec 19 '12 at 14:21

Stephen Swensen

22,107
9
81
136

This is what I thought of too but I wanted to use the tokens following the parsed `ID` in the parser as well. Like `DOT OP1 LPAREN ...` and so on – kongo2002 Dec 19 '12 at 14:39
Thanks for the input - your updated version looks pretty much the same like the one I have at the moment. The problem with that approach is that I cannot use the parsed *operation tokens* as `ID`s if they are not between `DOT` and `LPAREN` – kongo2002 Dec 19 '12 at 15:42
"pretty much the same" in effect, yes, but more aligned with the design patterns prescribed when using lex and yacc tools. I'll update my answer with a solution to the treating `operation`s as `ID`s in a moment. – Stephen Swensen Dec 19 '12 at 17:44
Actually, I take that back, I don't have a good solution off the top of my head to offer for the latest rule you are after. But now I think again the best approach would to be to go with the first approach I offered in my answer: lex with fslex, parse with fsyacc, and perform semantic analysis (i.e. rules like `DOT OP1 LPAREN ...`) in the custom semantic analysis function you use to process the AST generated by your parser. – Stephen Swensen Dec 19 '12 at 17:56
...try to keep your grammar as simple as possible, it can quickly explode and become difficult to debug. – Stephen Swensen Dec 19 '12 at 17:58
yeah, you're probably right. I simply need to practice some more with fslex/fsyacc to get the concepts straight. Do you now by any chance some good examples/resources to learn from (apart from the F# compiler source drop)? – kongo2002 Dec 19 '12 at 20:31
Humbly, most of my personal experience with fslex and fsyacc has been with my project http://code.google.com/p/nl-compiler/. Studying the F# compiler itself was a great learning experience for me: https://github.com/fsharp/fsharp. But probably the biggest influence on me was http://www.amazon.com/Modern-Compiler-Implementation-Andrew-Appel/dp/0521607647 (not in F# but in the father language ML). – Stephen Swensen Dec 20 '12 at 01:04

score 2 · Accepted Answer · answered Dec 19 '12 at 15:58

(This is a separate answer)

For this specific case, this might solve your issue better:

...

rule tokenize = parse
...
| '.' { DOT }
| '(' { LPAREN }
| identifier { ID (lexeme lexbuf) }

...

And the usage:

let parse'' text =
    let lexbuf = LexBuffer<char>.FromString text
    let rec tokenize =
        let stack = ref []
        fun lexbuf ->
        if List.isEmpty !stack then
            stack := [Lexer.tokenize lexbuf]
        let (token :: stack') = !stack // can never get match failure,
                                        // else the while wouldn't have exited
        stack := stack'
        // this match fixes the ID to an OP, if necessary
        // multiple matches (and not a unified large one),
              // else EOF may cause issues - this is quite important
        match token with
        | DOT ->
          match tokenize lexbuf with
          | ID id ->
            match tokenize lexbuf with
            | LPAREN ->
              let op = findOp id
              stack := op :: LPAREN :: !stack
            | t -> stack := ID id :: t :: !stack
          | t -> stack := t :: !stack
        | _ -> ()
        token
    Parser.start tokenize lexbuf

This will fix the ID's to be operations, if they are surrounded by DOT and LPAREN, and only then.

P.S.: I have 3 separate matches, because a unified match would require either using Lazy<_> values (which will make it even less readable), or will fail on a sequence of [DOT; EOF], because it'd expect an additional third token.

Thanks again for your effort. I now have to wrap my head around what you did here :-) — kongo2002, Dec 19 '12 at 16:43

How to return multiple tokens for one fslex rule pattern?

3 Answers3