31

So lately I have been working on writing a simple compiler to better understand compiler concepts. Being a diligent reader of stackoverfolow, it seems there is a consensus that writing a compiler in a functional language is easier than an imperative one. To this end I thought I would try and kill two birds and write a compiler in F# to both learn a functional language and write a compiler at the same time.

I have been reading through the dragon book and decided to start with a recursive descent parser written by hand in F#. The dragon book, however, has almost all of the code samples in an imperative style. For instance, the match token function does a significant part of its work via side effect.

So my question is what would a more traditional functional approach to parsing (i.e. few side effects) look like? I know that the Haskell compiler (GHC) is written in Haskell but I would appreciate a somewhat smaller and easier to comprehend code sample.

Second, is it worthwhile to try and adopt a functional approach to parsing, or is it really on optimizations to intermediate code that functional languages shine and I just haven't gotten there yet? That is, should I fuddle through the parsing in F# using an imperative style and switch to a more functional approach later on?

Samsdram
  • 1,615
  • 15
  • 18
  • Parsing is easy enough in Haskell, but code generation within that language I find very nice. Especially if you want to play with static analysis. – Orbling Dec 11 '10 at 01:12
  • Possibly see also http://stackoverflow.com/questions/531707/whats-the-best-way-to-write-a-parser-by-hand – Brian Dec 11 '10 at 02:17

5 Answers5

12

Answer derived from this blog article:

So my question is what would a more traditional functional approach to parsing (i.e. few side effects) look like?

Sounds like you need to separate functional (as in Lisp, Scheme, Standard ML, CAML, OCaml, F#) from purity (absence of side effects, as in Haskell) and incidental language features (algebraic datatypes, pattern matching).

Thanks to algebraic datatypes, pattern matching and higher-order functions, F# is a good for parsing and great for transformations and code generation but most production parsers written in F# are not pure. Historically, the family of languages F# is mostly derived from (the MetaLanguages, or MLs) were bred specifically for this kind of metaprogramming.

Here is a very simple set of mutually-recursive active patterns that parse and evaluate mathematical expressions composed of single digits, + - * operators and bracketed subexpressions:

> let rec (|Term|_|) = function
    | Factor(e1, t) ->
        let rec aux e1 = function
          | '+'::Factor(e2, t) -> aux (e1 + e2) t
          | '-'::Factor(e2, t) -> aux (e1 - e2) t
          | t -> Some(e1, t)
        aux e1 t
    | _ -> None
  and (|Factor|_|) = function
    | '-'::Factor(e, t) -> Some(-e, t)
    | Atom(e1, '*'::Factor(e2, t)) -> Some(e1 * e2, t)
    | Atom(e, t) -> Some(e, t)
    | _ -> None
  and (|Atom|_|) = function
    | c::t when '0'<=c && c<='9' -> Some(int(string c), t)
    | '('::Term(e, ')'::t) -> Some(e, t)
    | _ -> None;;
val ( |Term|_| ) : char list -> (int * char list) option
val ( |Factor|_| ) : char list -> (int * char list) option
val ( |Atom|_| ) : char list -> (int * char list) option

Here is an example of it being used to parse and evaluate an expression:

> let (Term e) = List.ofSeq "1+2*(3-4)*-5";;
val e : int * char list = (11, [])

That's a pure solution that's using pattern matching over lists with F#'s active patterns. In reality, you'll want to define a type for your abstract syntax tree and return a value of that type. This is really easy in F#:

type expr =
  | Int of int
  | Neg of expr
  | Add of expr * expr
  | Sub of expr * expr
  | Mul of expr * expr

  static member (~-) f = Neg f
  static member (+) (f, g) = Add(f, g)
  static member (-) (f, g) = Sub(f, g)
  static member (*) (f, g) = Mul(f, g)

let rec (|Term|_|) = function
  | Factor(e1, t) ->
      let rec aux e1 = function
        | '+'::Factor(e2, t) -> aux (e1 + e2) t
        | '-'::Factor(e2, t) -> aux (e1 - e2) t
        | t -> Some(e1, t)
      aux e1 t
  | _ -> None
and (|Factor|_|) = function
  | '-'::Factor(e, t) -> Some(-e, t)
  | Atom(e1, '*'::Factor(e2, t)) -> Some(e1 * e2, t)
  | Atom(e, t) -> Some(e, t)
  | _ -> None
and (|Atom|_|) = function
  | c::t when '0'<=c && c<='9' -> Some(Int(int(string c)), t)
  | '('::Term(e, ')'::t) -> Some(e, t)
  | _ -> None

let (Term e) = List.ofSeq "1+2*(3-4)*-5"

Note that only one minor change to the parser was required because the AST can also be constructed using the +, - and * operators.

Second, is it worthwhile to try and adopt a functional approach to parsing, or is it really on optimizations to intermediate code that functional languages shine and I just haven't gotten there yet?

You're talking about purity, not functional programming. Purity is not particularly useful in the context of parsing text and, in fact, can be a real hindrance (e.g. interning symbols is a nightmare in Haskell). However, F# has many other benefits that make it good for this set of problems. In particular, although other languages like OCaml have much better tools for parsing, I think F# is the best .NET language in this context.

That is, should I fuddle through the parsing in F# using an imperative style and switch to a more functional approach later on?

Depends entirely upon what you want to make functional. I'd use fslex and fsyacc with pure code to construct ASTs in the actions but impurities for anything like hash consing or generating unique IDs.

You may appreciate the following articles I have written on this subject at this blog (note paywall):

  • "Parsing text with Lex and Yacc" (30th September 2007).
  • "Optimizing a simple bytecode interpreter" (31st October 2007).
  • "Parser combinators" (30th November 2007).
  • "Language-oriented programming: The Term-level Interpreter" (31st December 2007).
  • "Language-oriented programming: Term Rewriting" (16th August 2008).
  • "Run-time code generation using System.Reflection.Emit" (31st August 2008).
  • "Parsing and visualizing binary Geographic Information System data" (30th November 2009).
J D
  • 48,105
  • 13
  • 171
  • 274
9

One strategy for functional parsing is monadic parser combinators. You can read some about it here (and follow links) or use a library like FParsec. I do not recommend this approach if you're just learning/starting F#/compilers, though.

Another approach with F# is to use FsLex/FsYacc (in the PowerPack). I kinda loathe Lex/Yacc technology, so I also don't recommend this.

I think you should write a recursive decent parser by hand. I don't have strong feelings regarding a tokenizer, but simply tokeninize the entire file into a(n immutable) list of tokens and then doing recursive descent (and leveraging some pattern-matching) is a good way to to deal with parsing. And of course, you'll want to use discrimated unions to represent the AST output of the parser (a la here).

I haven't read the dragon book in a long time, but I'm apparently the only person on the planet who doesn't like it. I would consider abandoning that text in favor of a book that discusses compilers using some ML-based language, though I can't recommend one offhand.

EDIT

I haven't done one of these in a while, so I took a few minutes to code a small sample.

// AST for tiny language
type Op = 
    | Plus 
    | Minus 
type Expr = 
    | Literal of int 
    | BinaryOp of Expr * Op * Expr // left, op, right 
type Stmt =
    | IfThenElse of Expr * Stmt * Stmt // cond, then, else; 0=false in cond 
    | Print of Expr

// sample program
let input = @"
    if 1+1-1 then 
        print 42 
    else 
        print 0"

// expected AST
let goal = 
    IfThenElse(
        BinaryOp( BinaryOp(Literal(1),Plus,Literal(1)), Minus, Literal(1)), 
        Print(Literal(42)), 
        Print(Literal(0))) 

////////////////////////////////////////////////////////////////////////////
// Lexer

type Token =
    | IF
    | THEN
    | ELSE
    | PRINT
    | NUM of int  // non-negative
    | PLUS
    | MINUS
    | EOF

let makeTokenizer (s:string) =
    let i = ref 0
    let keywords = [
        "if", IF 
        "then", THEN
        "else", ELSE
        "print", PRINT
        "+", PLUS
        "-", MINUS ]
    let rec getNextToken() =
        if !i >= s.Length then
            EOF
        elif System.Char.IsWhiteSpace(s.[!i]) then
            incr i
            getNextToken()
        elif System.Char.IsDigit(s.[!i]) then
            let mutable j = !i
            while j < s.Length && System.Char.IsDigit(s.[j]) do
                j <- j + 1
            let numStr = s.Substring(!i, j - !i)
            i := j
            NUM(System.Int32.Parse(numStr)) // may throw, e.g. if > MAXINT
        else 
            let keyword = keywords |> List.tryPick (fun (kwStr,kwTok) ->
                if s.IndexOf(kwStr, !i) = !i then
                    i := !i + kwStr.Length
                    Some(kwTok)
                else
                    None)
            match keyword with
            | Some k -> k
            | None -> 
                failwith "unexpected char '%c' at position %d" s.[!i] !i
    getNextToken

let tokens = 
    let nextToken = makeTokenizer input
    let t = ref(nextToken())
    [ 
        yield !t
        while !t <> EOF do
            t := nextToken()
            yield !t
    ]

printfn "%A" tokens // sanity check our tokenizer works

/////////////////////////////////////////////////////////////////////////
// Parser

let parseExpr toks =
    match toks with
    | NUM x :: rest ->
        let mutable rest = rest
        let mutable expr = Literal x
        while rest.Head = PLUS || rest.Head = MINUS do
            let op,y,r = 
                match rest with
                | PLUS::NUM y::t -> Plus, Literal y, t
                | MINUS::NUM y::t -> Minus, Literal y, t
                | _ -> 
                    failwith "parse error in expression, expected number"
            expr <- BinaryOp(expr, op, y)
            rest <- r
        expr, rest
    | _ -> failwith "parse error in expression, expected number"
let rec parseStmt toks =
    match toks with
    | PRINT :: rest -> 
        let e,rest = parseExpr(rest)
        Print(e), rest
    | IF :: rest ->
        let e,rest = parseExpr(rest)
        match rest with
        | THEN :: rest ->
            let s1,rest = parseStmt(rest)
            match rest with
            | ELSE :: rest ->
                let s2,rest = parseStmt(rest)
                IfThenElse(e,s1,s2), rest
            | _ -> 
                failwith "parse error after if branch, espected 'else'"
        | _ -> 
            failwith "parse error after if expression, expected 'then'"
    | _ -> failwith "parse error, expected statement"
let parseProgram toks =
    let s,rest = parseStmt toks
    match rest with
    | [EOF] -> s
    | _ -> failwith "parse error after statement, expected EOF"

let p = parseProgram tokens
printfn "%A" p
assert( p = goal )

(Hopefully there are no egregious bugs.)

Brian
  • 117,631
  • 17
  • 236
  • 300
  • I hadn't seen representing an AST with a discriminated union before. That is a pretty cool feature. – Samsdram Dec 11 '10 at 01:20
  • 9
    @Samsdram: Actually, when people are talking about how easy it is to write compilers in functional languages *that* is *exactly* what they are talking about. It's actually not so much about being a functional language, but rather a language with a sophisticated type system and powerful pattern matching. – Jörg W Mittag Dec 11 '10 at 01:25
  • 2
    Indeed, DUs and pattern-matching are just made for tree-processing, and compilers are all about tree-processing. – Brian Dec 11 '10 at 02:09
  • @Jörg W Mittag: I would also mention the [Tail call](http://en.wikipedia.org/wiki/Tail_call) which in our case ensures no stack overflow will happen during parsing as opposed to using recursion in traditional imperative languages where return address is put on top of the stack in each call. – YasirA Dec 11 '10 at 02:12
  • @Brain: why do you loath Lex/Yacc technology? how does the F# compiler go about lexing and parsing? – Stephen Swensen Dec 11 '10 at 17:31
  • 1
    The F# compiler uses FsLex and FsYacc, which is maybe part of why I don't like it. :) Yacc parsers are hard to author and understand and debug/diagnose IMO. And I also don't like having to use a DSL (an EDSL would be ok) because I lose so much tooling. And these are like 30- or 40- year-old technologies, so I have to imagine there is something better now. (Last time I used ANTLR was about 10 years ago, I think maybe it addresses one or two of my criticisms, but not all.) As an aside, I find parser combinators very hard to debug/diagnose as well. – Brian Dec 11 '10 at 17:39
  • @Brain: interesting. So you prefer hand-written recursive decent parsers? Is that technique suitable for languages as advanced as F#? I wonder how much room there is for improvement in this area, given parsing has got to be one of the oldest topics in computer science (though perhaps that is false thinking since computer science is quite young). Maybe it's just an inherently difficult topic, with no golden solution (i.e. "easy" to author and debug/diagnose yet powerful too). – Stephen Swensen Dec 11 '10 at 18:19
  • I prefer hand-written or LL parsers for debugging/diagnosis. So for example, the code ANTLR generates is debuggable, as I recall, since it uses LL technology. Yacc uses LR, which is in some sense more powerful, but it adds an abstraction layer that makes it very hard to tell what is going on when you try to debug it. I think part of the problem is that CS treated parsing as a 'solved problem' (which it is, in the academic sense), but real-world software requires more. (TBC...) – Brian Dec 11 '10 at 20:10
  • 1
    ...A command-line compiler only requires you to be able to parse correct programs (or else report an error). But an IDE requires that you can parse incorrect programs, and still recover and provide useful parse info so Intellisense, Parameter Info, etc, all continue to function even when the programmer is in the middle of editing the file and has un-finished code that does not parse. Which means you need sophisticated error recovery logic and lots of extra rules in the grammar to deal with common mal-formed fragments of code. And Yacc/LR tools are typically too opaque when parses fail. – Brian Dec 11 '10 at 20:15
  • Very nice example but for any complex language the implementation will balloon out of control. – ChaosPandion Dec 11 '10 at 22:46
  • @Brian: I find ocamllex and ocamlyacc easy to use including debugging. The performance is great and they have IDE support. I think it would be great if F# was up-to-par with them (e.g. named submatches in the lexer and support in the IDE) and you could even go further by making them EDSLs using run-time code generation. CamlP4 is an EDSL, of course. As for preserving info during edits, I think it is better to cache previous results rather than trying to dick with the parser. – J D Dec 12 '10 at 15:48
7

Parser combinators are indeed beautiful! FParsec is a very slick monadic parser combinator library you should check out. If you want to start out with something simple and still purely functional, you might enjoy the tokenizer/parser from the Scheme interpreter in F# series here (my blog): http://blogs.msdn.com/b/ashleyf/archive/2010/09/24/fscheme-0-0-0.aspx

Danilo Piazzalunga
  • 7,590
  • 5
  • 49
  • 75
AshleyF
  • 3,682
  • 1
  • 25
  • 23
7

A simpler answer than the other good answers:

A parser in a function language takes a token stream into a parse tree and the rest of the token stream. That is, it has type

 token list -> ast * token list

A recursive decent parser usually have a large number of functions that of this type which eats the token stream and then builds a little part of the parse tree. By calling these recursively (recursive decent) -- you get what you want.

The next step up is to use higher order parsers: parsers operating on other parsers. This is what parser combinator libraries do. Perhaps you could start with a simple recursion scheme and then upgrade it to parser combinators.

I GIVE CRAP ANSWERS
  • 18,739
  • 3
  • 42
  • 47
4

I've been working on a ECMAScript compiler in F# for a while so I'm in the same boat as you. Mayhap some of my work could be of use to you. Here is a simple parser combinator library I've been working on which I use in combination with FParsec. It is no where near perfect but it should give you something simple enough to study so you can move to more advanced things. If you do end up using FParsec you may notice that a lot of things here were inspired by it.

module Tools =

    open System
    open System.Diagnostics
    open LazyList 

    [<Struct;DebuggerStepThrough>]
    type State<'a, 'b> (input:LazyList<'a>, data:'b) = //'
        member this.Input = input
        member this.Data = data

    type Result<'a, 'b, 'c> = //'
    | Success of 'c * State<'a, 'b>
    | Failure of list<string> * State<'a, 'b>    

    type Parser<'a, 'b, 'c> = //' 
        State<'a, 'b> -> seq<Result<'a, 'b, 'c>>

    let zero<'a, 'b, 'c> (state:State<'a, 'b>) = //'
        Seq.empty<Result<'a, 'b, 'c>>

    let item<'a, 'b> (state:State<'a, 'b>) = seq { //'
        match state.Input with
        | Cons (head, tail) ->
            yield Success(head, State (tail, state.Data))
        | Nil -> ()
    } 

    let result<'a, 'b, 'c> (value:'c) (state:State<'a, 'b>) = seq  { //'
        yield Success (value, state)
    }

    let run p i d =
        p (State(i, d)) 

    let (>>=) (m:Parser<'a, 'b, 'c>) (f:'c -> Parser<'a, 'b, 'd>) (state:State<'a, 'b>) = //'
        let rec run errors = seq {
            for r in m state do
                match r with
                | Success (v, s) ->
                    yield! f v s
                | Failure (ms, s) ->
                    yield! run (errors @ ms)
        }
        run []

    let (<|>) (l:Parser<'a, 'b, 'c>) (r:Parser<'a, 'b, 'c>) (state:State<'a, 'b>) = //'  
        let rec run p = seq {
            for result in p state do
                match result with
                | Success (_, _) ->
                    yield result
                | Failure (_, _) -> ()
        }
        Seq.append (run l) (run r)

    type ParseMonad() =        
        member this.Bind (f:Parser<'a, 'b, 'c>, g:'c -> Parser<'a, 'b, 'd>) : Parser<'a, 'b, 'd> = f >>= g //'     
        member this.Combine (f, g) = f <|> g      
        member this.Delay (f:unit -> Parser<'a, 'b, 'c>) (state:State<'a, 'b>) = f () state //'
        member this.Return x = result x
        member this.ReturnFrom p = p
        member this.Zero () = zero

    let parse = ParseMonad()

    let (|>>) (parser:Parser<'a, 'b, 'c>) (f:'c -> 'd) = parse { //'
        let! v = parser
        return f v   
    }

    let satisfy predicate = parse {
        let! value = item
        if predicate value then
            return value 
    }

    let maybe parser = parse {
        return! parser |>> Some <|> result None 
    }

    let choice (ps:seq<Parser<'a, 'b, 'c>>) (state:State<'a, 'b>) = seq { //'
        if not (LazyList.isEmpty state.Input) then
            for p in ps do
                yield! p state    
    }

    let between left right parser =
        parse {
            let! _ = left
            let! v = parser
            let! _ = right
            return v
        }

    let skip p = parse {
        let! v = p
        return ()
    }

    let many parser = 
        let rec many result = parse {
            let! v = parser
            let result = v::result
            return! many result
            return result    
        }
        many []

    let many1 parser = parse {
        let! r = many parser
        if not r.IsEmpty then
            return r
    }

    let manyFold parser start (f:_ -> _ -> _) = parse {
        let! r = many parser
        return r |> List.fold f start
    }

    let many1Fold parser start (f:_ -> _ -> _) = parse {
        let! r = many1 parser
        return r |> List.fold f start
    } 

    let isNotFollowedBy p =
        parse {
            let! v = maybe p
            match v with
            | Some _ -> ()
            | None -> return ()
        }

    let pipe2 (p1:Parser<'a, 'b, 'c>) (p2:Parser<'a, 'b, 'd>) (f:'c -> 'd -> 'e) = //' 
        parse {
            let! v1 = p1
            let! v2 = p2
            return f v1 v2
        }

    let pipe3 (p1:Parser<'a, 'b, 'c>) (p2:Parser<'a, 'b, 'd>) (p3:Parser<'a, 'b, 'e>) (f:'c -> 'd -> 'e -> 'f) = //' 
        parse {
            let! v1 = p1
            let! v2 = p2
            let! v3 = p3
            return f v1 v2 v3
        }

    let pipe4 (p1:Parser<'a, 'b, 'c>) (p2:Parser<'a, 'b, 'd>) (p3:Parser<'a, 'b, 'e>) (p4:Parser<'a, 'b, 'f>) (f:'c -> 'd -> 'e -> 'f -> 'g) = //' 
        parse {
            let! v1 = p1
            let! v2 = p2
            let! v3 = p3
            let! v4 = p4
            return f v1 v2 v3 v4
        }

    let pipe5 (p1:Parser<'a, 'b, 'c>) (p2:Parser<'a, 'b, 'd>) (p3:Parser<'a, 'b, 'e>) (p4:Parser<'a, 'b, 'f>) (p5:Parser<'a, 'b, 'g>) (f:'c -> 'd -> 'e -> 'f -> 'g -> 'h) = //' 
        parse {
            let! v1 = p1
            let! v2 = p2
            let! v3 = p3
            let! v4 = p4
            let! v5 = p5
            return f v1 v2 v3 v4 v5
        }

    let tuple2<'a, 'b, 'c, 'd, 'e> (p1:Parser<'a, 'b, 'c>) (p2:Parser<'a, 'b, 'd>) (f:'c * 'd -> 'e) = //' 
        parse {
            let! v1 = p1
            let! v2 = p2
            return f (v1, v2)
        }

    let tuple3 (p1:Parser<'a, 'b, 'c>) (p2:Parser<'a, 'b, 'd>) (p3:Parser<'a, 'b, 'e>) (f:'c * 'd * 'e -> 'f) = //' 
        parse {
            let! v1 = p1
            let! v2 = p2
            let! v3 = p3
            return f (v1, v2, v3)
        }

    let tuple4 (p1:Parser<'a, 'b, 'c>) (p2:Parser<'a, 'b, 'd>) (p3:Parser<'a, 'b, 'e>) (p4:Parser<'a, 'b, 'f>) (f:'c * 'd * 'e * 'f -> 'g) = //' 
        parse {
            let! v1 = p1
            let! v2 = p2
            let! v3 = p3
            let! v4 = p4
            return f (v1, v2, v3, v4)
        }

    let tuple5 (p1:Parser<'a, 'b, 'c>) (p2:Parser<'a, 'b, 'd>) (p3:Parser<'a, 'b, 'e>) (p4:Parser<'a, 'b, 'f>) (p5:Parser<'a, 'b, 'g>) (f:'c * 'd * 'e * 'f * 'g -> 'h) = //' 
        parse {
            let! v1 = p1
            let! v2 = p2
            let! v3 = p3
            let! v4 = p4
            let! v5 = p5
            return f (v1, v2, v3, v4, v5)
        }

    let createParserRef<'a, 'b, 'c> () = //'
        let dummyParser = fun state -> failwith "a parser was not initialized"
        let r = ref dummyParser
        (fun state -> !r state), r : Parser<'a, 'b, 'c> * Parser<'a, 'b, 'c> ref //'

NOTE: You will need FSharp PowerPack for the LazyList type.

Example:

and conditionalExpressionNoIn = 
    parse {
        let! e1 = logicalORExpressionNoIn
        return! parse {
            do! skip expectQuestionMark
            let! e2 = assignmentExpression
            do! skip expectColon
            let! e3 = assignmentExpressionNoIn
            return ConditionalExpressionNoIn (e1, e2, e3)
        }
        return ConditionalExpressionNoIn (e1, SourceElement.Nil, SourceElement.Nil)
    }
ChaosPandion
  • 77,506
  • 18
  • 119
  • 157