4

I'm stuck on a problem with writing a parser in Haskell that I hope someone can help out with!

It is a bit more complicated than my usual parser because there are two layers of parsing. First a language definition is parsed into an AST, then that AST is transformed into another parser that parses the actual language.

I have made pretty good progress so far but I'm stuck on implementing recursion in the language definition. As the language definition is transformed from AST into a parser in a recursive function, I can't work out how it can call itself if it doesn't exist yet.

I'm finding it a bit hard to explain my problem, so maybe an example will help.

The language definition might define that a language consists of three keywords in sequence and then optional recursion in brackets.

A B C ($RECURSE)

Which would be parsed into an AST like:

[Keyword "A", Keyword "B", Keyword "C", Optional (Many [Recurse])]

The Many is not really required for this example, but in my actual project, optional blocks can have multiple syntax elements in them so an Optional would contain a Many with n elements.

I would then want it to get transformed into a parser that parses strings like:

A B C
A B C (A B C)
A B C (A B C (A B C))

I've boiled down my project into the simplest possible example. You can see my TODO comment where I'm stuck trying to implement the recursion.

{-# LANGUAGE OverloadedStrings #-}

module Example
  ( runExample,
  )
where

import Control.Applicative hiding (many, some)
import Data.Text (Text)
import Data.Void
import System.IO as SIO
import Text.Megaparsec hiding (State)
import Text.Megaparsec.Char (space1, string')
import qualified Text.Megaparsec.Char.Lexer as L
import Text.Megaparsec.Debug
import Text.Pretty.Simple (pPrint)

-- Types

type Parser = Parsec Void Text

data SyntaxAst = Keyword Text | Recurse | Optional SyntaxAst | Many [SyntaxAst]

--  Megaparsec Base Parsers

-- Space consumer - used by other parsers to ignore whitespace
sc :: Parser ()
sc =
  L.space
    space1
    (L.skipLineComment "--")
    (L.skipBlockComment "/*" "*/")

-- Runs a parser, then consumes any left over space with sc
lexeme :: Parser a -> Parser a
lexeme = L.lexeme sc

-- Parses a string, then consumes any left over space with sc
symbol :: Text -> Parser Text
symbol = L.symbol sc

-- Parses something between parentheses
inParens :: Parser a -> Parser a
inParens =
  between
    (symbol "(")
    (symbol ")")

-- Transforms the AST into a parser
transformSyntaxExprToParser :: SyntaxAst -> Parser [Text]
transformSyntaxExprToParser (Many exprs) = dbg "Many" (createParser exprs)
transformSyntaxExprToParser (Keyword text) = dbg "Keyword" (pure <$> lexeme (string' text))
transformSyntaxExprToParser (Optional inner) = dbg "Optional" (option [] (try (inParens (transformSyntaxExprToParser inner))))
transformSyntaxExprToParser Recurse = dbg "Recurse" (pure ["TODO"]) -- TODO: How do I recurse here?
-- transformSyntaxExprToParser s Recurse = dbg "Recurse" (createParser s) -- Seems to work in the example, but in my actual application creates an infinite loop and freezes

-- Walks over the parser AST and convert it to a parser
createParser :: [SyntaxAst] -> Parser [Text]
createParser expressions =
  do
    foldr1 (liftA2 (<>)) (fmap transformSyntaxExprToParser expressions)

runExample :: IO ()
runExample = do
  -- To make the example simple, lets cut out the language definition parsing and just define
  -- it literally.
  let languageParser = createParser [Keyword "A", Keyword "B", Keyword "C", Optional (Many [Recurse])]
  let run p = runParser p "" "A B C (A B C (A B C))"
  let result = run languageParser
  case result of
    Left bundle -> SIO.putStrLn (errorBundlePretty bundle)
    Right xs -> pPrint xs

A few things I've tried:

  1. Pass the original AST up to the transformSyntaxExprToParser function and call createParser when the Recurse token is encountered. This didn't work due to infinite loops.
  2. Using mutable references like IORef/STRef to pass in a reference that is updated to reference the final parser once the transformation is finished. I couldn't work out how to thread the IO/ST monads into the parser transform function.
  3. State monads. I couldn't work out how to pass a reference through the state monad.

I hope that makes sense, let me know if I need to elaborate more. I can also push up my full project if it will help.

Thanks for reading!

Edit: I've made changes to my original example to demonstrate the infinite loop problem (integrating the excellent suggestions in the answer below) at https://pastebin.com/DN0JJ9BA

Sean Dawson
  • 5,587
  • 2
  • 27
  • 34
  • 2
    That `SyntaxAst` type doesn't look like an abstract syntax *tree*, but rather like a syntax *element* only. The list is your actual AST. – Bergi Jun 10 '21 at 11:26
  • 1
    Can you show us the code of your attempt #1? That approach should work. – Bergi Jun 10 '21 at 11:26
  • @Bergi - You're right, Looks like it works in my contrived example `transformSyntaxExprToParser :: [SyntaxAst] -> SyntaxAst -> Parser [Text] transformSyntaxExprToParser _ (Keyword text) = dbg "Keyword" (pure <$> lexeme (string' text)) transformSyntaxExprToParser s (Optional inner) = dbg "Optional" (option [] (try (inParens (transformSyntaxExprToParser s inner)))) transformSyntaxExprToParser s Recurse = dbg "Recurse" (createParser s)` In my actual application it froze because the parser was stuck in an infinite loop though. – Sean Dawson Jun 10 '21 at 11:34
  • Do you know why that doesn't create an infinite loop? I would have thought it would keep hitting the recurse token and recursing forever. Is that because Haskell is lazy? – Sean Dawson Jun 10 '21 at 11:37
  • I'll have to keep working on it in the morning and I'll update my question with whatever detail it is that causes the infinite loop. – Sean Dawson Jun 10 '21 at 11:40
  • 4
    Yes, it's because of lazy evaluation. However, this approach requires transforming the AST again and again - it would be more efficient to have a `transformSyntaxExprToParser :: Parser [Text] -> SyntaxAst -> Parser [Text]` where `createParser` passes [its own result](https://stackoverflow.com/q/4787421/1048572) into the transform to become the result for a `Recurse` token. – Bergi Jun 10 '21 at 12:17
  • 2
    It sounds like your $RECURSE is actually a weird way of saying what in BNF would be `foo ::= A B C "("foo | {}")"` Would that be right? – Paul Johnson Jun 10 '21 at 15:27
  • @PaulJohnson yes that would be exactly it in BNF. I'm trying to parse a language definition that is part of documentation and it isn't in BNF, otherwise it might be simpler! – Sean Dawson Jun 10 '21 at 22:24

1 Answers1

2

I believe you can use laziness here. Pass the final parser as a parameter to transformSyntaxExprToParser, and when you see a Recurse, return that parser.

transformSyntaxExprToParser :: Parser [Text] -> SyntaxAst -> Parser [Text]
transformSyntaxExprToParser self = go
  where
    go (Keyword text) = dbg "Keyword" (pure <$> lexeme (string' text))
    go (Optional inner) = dbg "Optional" (option [] (try (inParens (go inner))))
    go Recurse = dbg "Recurse" self

createParser :: [SyntaxAst] -> Parser [Text]
createParser expressions = parser
  where
    parser = foldr1 (liftA2 (<>))
      (fmap (transformSyntaxExprToParser parser) expressions)

This ought to produce exactly the same kind of recursive parser as if you had written it directly. A Parser is ultimately just a data structure which you can construct using its instances of Monad, Applicative, Alternative, &c.

Your idea of doing this with a mutable reference such as an IORef is essentially what’s happening under the hood anyway when constructing and evaluating a thunk.

Your idea here was almost correct:

Pass the original AST up to the transformSyntaxExprToParser function and call createParser when the Recurse token is encountered. This didn't work due to infinite loops.

The problem is that you were constructing a new parser for every Recurse, from the same input, which contains a Recurse, thus constructing a new parser…and so on. What my code above does is just pass in the same parser.

If you need to perform monadic side effects while constructing the parser, such as logging, then you can use a recursive do, for example, with some hypothetical MonadLog class for illustration:

{-# Language RecursiveDo #-}

transformSyntaxExprToParser :: (MonadLog m) => Parser [Text] -> SyntaxAst -> m (Parser [Text])
transformSyntaxExprToParser self = go
  where
    go (Keyword text) = do
      logMessage "Got ‘Keyword’"
      pure $ dbg "Keyword" (pure <$> lexeme (string' text))
    go (Optional inner) = do
      logMessage "Got ‘Optional’"
      inner' <- go inner
      pure $ dbg "Optional" (option [] (try (inParens inner')))
    go Recurse = do
      logMessage "Got ‘Recurse’"
      pure $ dbg "Recurse" self

createParser :: (MonadFix m, MonadLog m) => [SyntaxAst] -> m (Parser [Text])
createParser expressions = do
  rec
    parser <- fmap (foldr1 (liftA2 (<>)))
      (traverse (transformSyntaxExprToParser parser) expressions)
  pure parser

The rec block introduces a recursive binding which you may construct using side effects. In general, some care is required to ensure that recursive definitions like this are sufficiently lazy, that is, you don’t force the result sooner than intended, but here the recursion pattern is very simple, and you never examine the self parser, only treat it as a black box to hook up to other parsers.

This method also makes it explicit what the scope of a Recurse is, and opens the possibility of introducing local recursive parsers, with a new call to transformSyntaxExprToParser with a new local self argument.

Jon Purdy
  • 53,300
  • 8
  • 96
  • 166
  • Thanks for the amazing answer! I've reworked the code to match what you describe above and it is much cleaner. I'm still getting some sort of infinite loop though and the program hangs. I'll have to do some more digging to work out if something is forcing strict evaluation or something. – Sean Dawson Jun 10 '21 at 22:22
  • Ah I've worked out how to reproduce the infinite loop I get in my main application. I have another type constructor on `SyntaxAst` called `Many [SyntaxAst]` which defines a sequence of Ast in order. For example if multiple keywords are wrapped in an optional. I'l try and update my question to match. – Sean Dawson Jun 10 '21 at 22:53
  • I've put up my original example code, integrating the changes you suggested above that demonstrates the infinite loop problem at https://pastebin.com/DN0JJ9BA – Sean Dawson Jun 10 '21 at 23:09
  • Ah I see, if I copy the fold up out of `createParser` and put it in `transformSyntaxExprToParser` instead of calling `createParser` recursively it seems to work! https://pastebin.com/vU31tm5z – Sean Dawson Jun 10 '21 at 23:17