2

Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.

For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.

Syntactically valid strings would include:

  • (the empty string)
  • A,
  • B A, and
  • C A B

Syntactically invalid strings would include:

  • A A, and
  • B A C B

To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.

From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.


Update

What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.

Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.

A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.

Community
  • 1
  • 1
drfloob
  • 3,154
  • 1
  • 24
  • 31
  • The language is finite (though it could be pretty big), so it is definitely regular (and consequently context-free) because all finite languages are regular. (Trivial proof: the language is defined as the conjunction of all possible strings `w1|w2|w3|w4...|wlast`, which is obviously a regular expression.) That fact is not of much use to you, but it's still a fact. – rici Aug 27 '13 at 01:52
  • Also, the ANTLR solution is not a hack, IMHO. It (or its equivalent for another parser generator which allows predicates) is probably your best bet. In your case, you don't need the validation, so it's even simpler. – rici Aug 27 '13 at 01:55

1 Answers1

1

Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).

For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.

S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY  ::= ab | ba
XacY  ::= ac | ca
XbcY  ::= bc | cb

Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).

nickie
  • 5,608
  • 2
  • 23
  • 37
  • I think this is a regular language with O(SUMi(C(N,i))) states in the automata, which should be much less than O(N!). The automata would have one state for each symbol as first layer, one state for each combination of symbols in the second, and so on,until the final state with all symbols. – Apalala Aug 27 '13 at 00:03
  • @Apalala: it's 2**N, so, yes, it's smaller than N!, but it's still pretty big if N is not very small. – rici Aug 27 '13 at 01:50
  • @rici 2**N possible states, or O(2**N) memory, but with linear parsing, O(N), for any input sequence. The memory requirement can be bypassed if the automata is built on the go. This seems like a homework question on language theory. The problem can be solved in O(N) algorithmically. – Apalala Aug 27 '13 at 05:07
  • Oh, as far as my experience goes, most languages one would be interested in require semantics fed into the parsing. IOW, most interesting languages are not CF. – Apalala Aug 27 '13 at 05:11
  • @Apalala: no argument from me on either count. It's abundantly clear that the problem can be solved algorithmically (with the aid of a boolean vector of "seen[token]"), but it seems more like a practical parsing question than formal language theory to me. Still, I've been out of the classrooom (from either side) for too long to really be certain. And, this language happens to be CF, even if its not convenient to parse that way. This particular problem received a certain amount of attention many years ago owing to the & operator in SGML declarations, but I haven't seen it since. – rici Aug 27 '13 at 05:35