Generating correct phrases from PEG grammars

Question

I wrote a PEG parser generator just for fun (I will publish it on NPM some time), and thought it would be easy to add a randomised phrase generator on top of it. The idea is to automatically get correct phrases, given a grammar. So I set the following rules to generate strings from each type of parsers :

Sequence p1 p2 ... pn : Generate a phrase for each subparser and return the concatenation.
Alternative p1 | p2 | ... | pn : Randomly pick a subparser and generate a phrase with it.
Repetition p{n, m} : Pick a number x in [n, m] (or [n, n+2] is m === Infinity) and return a concatenation of x generated phrases from p.
Terminal : Just return the terminal literal.

When I take the following grammar :

S: NP VP
PP: P NP
NP: Det N | Det N PP | 'I'
VP: V NP | VP PP
V: 'shot' | 'killed' | 'wounded'
Det: 'an' | 'my' 
N: 'elephant' | 'pajamas' | 'cat' | 'dog'
P: 'in' | 'outside'

It works great. Some examples :

my pajamas killed my elephant
an pajamas wounded my pajamas in my pajamas
an dog in I wounded my cat in I outside my elephant in my elephant in an pajamas outside an cat
I wounded my pajamas in my dog

This grammar has a recursion (PP: P NP > NP: Det N PP). When I take this other recursive grammar, for math expression this time :

expr: term (('+' | '-') term)*
term: fact (('*' | '/') fact)*
fact: '1' | '(' expr ')'

Almost one time in two, I get a "Maximum call stack size exceeded" error (in NodeJS). The other half of the time, I get correct expressions :

( 1 ) * 1 + 1
( ( 1 ) / ( 1 + 1 ) - ( 1 / ( 1 * 1 ) ) / ( 1 / 1 - 1 ) ) * 1
( ( ( 1 ) ) )
1
1 / 1

I guess the recursive production for fact gets called too often, too deep in the call stack and this makes the whole thing just blow off.

How can I make my approach less naive in order to avoid those cases that explode the call stack ? Thank you.

Note that with PEG, "Alternative ... : Randomly pick a subparser and generate a phrase with it" will not necessarily produce a valid input. That's because PEG alternatives are ordered, and an early alternative can shadow a subset of the prefixes of later alternatives. In other words, you could randomly select a phrase generated by a subparser which starts with something which could be generated by an earlier subparser and thus could never be recognised by the parser. I think there is a related issue with repetitions. — rici, Jan 24 '20 at 02:08
the simple way would be to verify that the generated sentence can be parsed before returning it. Depending on your grammar, that might or might not require a lot of retries; I'm not even going to venture a guess. Other than that, I don't know; it seems to me that just about all interesting theoretical questions about PEGs turn out to be intractable and I have never tried very hard to overcome that. — rici, Jan 24 '20 at 02:23

Gene · Answer 1 · 2020-01-24T02:35:43.823

Of course if a grammar describes arbitrarily long inputs, you can easily end up in a very deep recursion. A simple way to avoid this trap is keep a priority queue of partially expanded sentential forms where the key is length. Remove the shortest and replace each non-terminal in each possible way, emitting those that are now all terminals and adding the rest back onto the queue. You might also want to maintain an "already emitted" set to avoid emitting duplicates. If the grammar doesn't have anything like epsilon productions where a sentential form derives a shorter string, then this method produces all strings described by the grammar in non-decreasing length order. That is, once you've seen an output of length N, all strings of length N-1 and shorter have already appeared.

Since OP asked about details, here's an implementation for the expression grammar. It's simplified by rewriting the PEG as a CFG.

import heapq

def run():
  g = {
    '<expr>': [
      ['<term>'],
      ['<term>', '+', '<expr>'],
      ['<term>', '-', '<expr>'],
    ],
    '<term>': [
      ['<fact>'],
      ['<fact>', '*', '<term>'],
      ['<fact>', '/', '<term>'],
    ],
    '<fact>': [
      ['1'],
      ['(', '<expr>', ')']
    ],
  }
  gen(g)

def is_terminal(s):
  for sym in s:
    if sym.startswith('<'):
      return False;
  return True;

def gen(g, lim = 10000):
  q = [(1, ['<expr>'])]
  n = 0;
  while n < lim:
    _, s = heapq.heappop(q)
    # print("pop: " + ''.join(s))
    a = []
    b = s.copy()
    while b:
      sym = b.pop(0)
      if sym.startswith('<'):
        for rhs in g[sym]:
          s_new = a.copy()
          s_new.extend(rhs)
          s_new.extend(b)
          if is_terminal(s_new):
            print(''.join(s_new))
            n += 1
          else:
            # print("push: " + ''.join(s_new))
            heapq.heappush(q, (len(s_new), s_new))
        break # only generate leftmost derivations
      a.append(sym)

run()

Uncomment the extra print()s to see heap activity. Some example output:

1
(1)
1*1
1/1
1+1
1-1
((1))
(1*1)
(1/1)
(1)*1
(1)+1
(1)-1
(1)/1
(1+1)
(1-1)
1*(1)
1*1*1
1*1/1
1+(1)
1+1*1
1+1/1
1+1+1
1+1-1
1-(1)
1-1*1
1-1/1
1-1+1
1-1-1
1/(1)
1/1*1
1/1/1
1*1+1
1*1-1
1/1+1
1/1-1
(((1)))
((1*1))
((1/1))
((1))*1
((1))+1
((1))-1
((1))/1
((1)*1)
((1)+1)
((1)-1)
((1)/1)
((1+1))
((1-1))
(1)*(1)
(1)*1*1
(1)*1/1
(1)+(1)
(1)+1*1

Do you have more detail about this technique, like a paper ? Has it a name ? — ostrebler, Jan 24 '20 at 01:23
No. But I've implemented it in the past for CFGs to generate test data. It works fine. If you really need code, I can probably hack up an example. — Gene, Jan 24 '20 at 01:31
@Strebler Okay I added a quick-and-dirty implementation. Lots of fun! — Gene, Jan 24 '20 at 02:16
That does a nice job of generating *all* sentences, but it might take a while before it generates an interesting one of medium length (which is kind of the inverse problem that the OP had, where the sample is dominated by long sentences). See [this answer](https://stackoverflow.com/a/58223301/1566221) for a reference to a paper which provides an algorithm for producing random sentences from an (epsilon-free) CFG. — rici, Jan 24 '20 at 02:20
@rici Sure. No warranty expressed or implied. The ascending length behavior was perfect for the testing application I used it for. — Gene, Jan 24 '20 at 02:24
@rici I've also found [this article](https://eli.thegreenplace.net/2010/01/28/generating-random-sentences-from-a-context-free-grammar) — ostrebler, Jan 24 '20 at 02:27
@Strebler: Yeah, the algo in the paper I referenced is somewhat similar but it does the work of computing more accurate weightings, so that you really get a uniform sample (of a given size). In Python, it's very easy to implement because you don't have to worry about arithmetic overflow. — rici, Jan 24 '20 at 02:31
Oh, and here's an [old answer of mine](https://stackoverflow.com/a/17392943/1566221) which I think is pretty similar to your (Gene's) algo, although it doesn't come accompanied with a real implementation. (The algo at the end, not the one the answer starts with.) — rici, Jan 24 '20 at 02:39

Generating correct phrases from PEG grammars

1 Answers1