Using the Earley library to parse with features and unification

Question

The Earley parsing library is great for writing linguistic parsers in Haskell. CFGs can be specified in an intuitive way, and there is excellent support for backtracking and ambiguity. A simple example:

{-# LANGUAGE OverloadedStrings #-}

import Text.Earley

np = rule ("John" <|> "Mary")
vp = rule ("runs" <|> "walks")

sentence = do
  subj <- np
  pred <- vp
  return $ (++) <$> subj <*> pred

sentence can be used to parse ["John", "runs"] or ["Mary", "walks"], among other inputs.

It would be nice to be able to use Earley to write parsers for FCFGs, where nonterminals are complexes of a label and a feature bundle, and feature matching can happen via unification (for example, the Earley parser in NLTK parses FCFGs). However, it is not clear how to do this using Earley, or whether it can even be done. An example of something we might want in something like BNF:

np[sg] ::= "John" | "Mary"

np[?x] ::= det n[?x]
n[pl]  ::= "boys" | "girls"

det    ::= "the"

vp[sg] ::= "runs" | "walks"
vp[pl] ::= "run"  | "walk"

s ::= np[?x] vp[?x]

Under this FCFG, ["John", "runs"] is an s (since their number features match, as required by the s rule), and ["the", "boys", "walks"] isn't an s (since ["the", "boys"] parses to np[pl] and ["walks"] parses to vp[sg]).

One can in general rewrite an FCFG into an equivalent CFG, but this can be highly inconvenient, and result in a blowup of the grammar, especially when we have many possible features ranging over many possible values.

Daniel Wagner · Accepted Answer · 2023-02-20T17:25:02.173

You're not actually doing any particularly interesting unification here, so perhaps it's enough to toss a very simple nondeterminism applicative of your own into the mix. The standard one is [], but for this case, even Maybe looks like enough. Like this:

{-# Language OverloadedStrings #-}
{-# Language TypeApplications #-}

import Control.Applicative
import Control.Monad
import Data.Foldable
import Text.Earley

data Feature = SG | PL deriving (Eq, Ord, Read, Show)

(=:=) :: (Feature, a) -> (Feature, b) -> Maybe (a, b)
(fa, a) =:= (fb, b) = (a, b) <$ guard (fa == fb)

data NP = Name String | Determined String String deriving (Eq, Ord, Read, Show)

np :: Grammar r (Prod r e String (Feature, NP))
np = rule . asum $
    [ fmap (\name -> (SG, Name name)) ("John" <|> "Mary")
    , liftA2 (\det n -> (PL, Determined det n)) "the" ("boys" <|> "girls")
    ]

vp :: Grammar r (Prod r e String (Feature, String))
vp = rule . asum $
    [ (,) SG <$> ("runs" <|> "walks")
    , (,) PL <$> ("run" <|> "walk")
    ]

s :: Grammar r (Prod r e String (Maybe (NP, String)))
s = liftA2 (liftA2 (=:=)) np vp

test :: [String] -> IO ()
test = print . allParses @() (parser s)

Try it out in ghci:

> sequence_ [test (words n ++ [v]) | n <- ["John", "the boys"], v <- ["walks", "walk"]]
([(Just (Name "John","walks"),2)],Report {position = 2, expected = [], unconsumed = []})
([(Nothing,2)],Report {position = 2, expected = [], unconsumed = []})
([(Nothing,3)],Report {position = 3, expected = [], unconsumed = []})
([(Just (Determined "the" "boys","walk"),3)],Report {position = 3, expected = [], unconsumed = []})

So, the result needs a bit of interpretation -- a successful parse of Nothing really counts as a failed parse -- but perhaps that's not so bad? Not sure. Certainly it's unfortunate that you don't get to reuse Earley's error-reporting and nondeterminism machinery. Probably to get either thing, you'd have to fork Earley.

If you need to do real unification you could look into returning a IntBindingT t Identity instead of a Maybe, but at least until your features are themselves recursive this is probably enough and much, much simpler.

A couple points/questions. First, for efficiency reasons, I prefer to have failed parses in lieu of successful parses to `Nothing` (I expect your approach returns exponentially many `Nothing` parses). Also, the example was a toy/minimal one. In general, the feature structures can be much more complex, and their possible values much more numerous. It would be prohibitive to list separate entries for every word like `"the"` which simply passes up the features on its neighbor. I imagine that can be dealt with via a richer `Feature` type and a more elaborate `(=:=)`? — SEC, Feb 20 '23 at 14:42
@SEC The approach should only return exponentially many `Nothing` parses if the unannotated grammar would. I suggest a package in the answer already for supplying a more elaborate `(=:=)`. — Daniel Wagner, Feb 20 '23 at 15:07
Yes, but a more restrictive grammar shouldn't be doing as much work as a less restrictive counterpart (by analogy, a regular CFG shouldn't generate exponentially many failed parses for an unambiguous string, even though there are that many bracketings). I think what would help is being able to parameterize rules like `np` and `vp` to featural information generated by a prior parsing rule, as is done in standard implementations of FCFGs like definite clause grammars, but that seems impossible using the `Earley` API, since `Prod r e t` and `Grammar r . Prod r e t` aren't monads. — SEC, Feb 20 '23 at 15:43
@SEC I see that something in my head was not clear in the answer. I will clarify that shortly in the answer. But the delta is this: if you want to meld Earley's nondeterminism with some other nondeterministic mechanism, it looks to me like you will need to fork Earley. — Daniel Wagner, Feb 20 '23 at 17:24

Using the Earley library to parse with features and unification

1 Answers1