Is there a fast algorithm to determine the godel number of a term of a context free language?

Question

Suppose we have a simple grammar specification. There is a way to enumerate terms of that grammar that guarantees that any finite term will have a finite position, by iterating it diagonally. For example, for the following grammar:

S      ::= add
add    ::= mul | add + mul
mul    ::= term | mul * term
term   ::= number | ( S )
number ::= digit | digit number
digit  ::= 0 | 1 | ... | 9

You can enumerate terms like that:

0
1
0+0
0*0
0+1
(0)
1+0
0*1
0+0*0
00
... etc

My question is: is there a way to do the opposite? That is, to take a valid term of that grammar, say, 0+0*0, and find its position on such enumeration - in that case, 9?

Enumerate until you hit the term? That's not fast, though, obviously. I would have posted this to CS though, SO isn't really the place for this kind of thing anymore I think. — G. Bach, May 28 '14 at 23:18
I would venture to guess that CS isn't the place for it either. It's hard for me to imagine a situation where one would care about the efficiency of an algorithm for Gödel numbering. — dfeuer, May 28 '14 at 23:28
Do you require the enumeration to be "dense", to not have gaps? It is easy to use structural recursion to make an injection into the naturals, but bijectivity seems harder. — phs, May 28 '14 at 23:33
This isn't really a duplicate, but it seems relevant: http://stackoverflow.com/questions/17387686/how-to-enumerate-the-strings-of-a-context-free-grammar/17392943#17392943 As I said in that answer, the naïve enumeration by diagonalization only works if the grammar is unambiguous. — rici, May 28 '14 at 23:43
I'm not sure whether this would help or not, but you may be able to use the [Chomsky-Schutzenberger theorem] to get a generating function for the CFG and use that to quickly determine how many strings there are that are shorter than your given string. That may then enable you to figure out which index it would occur at, though I'm not entirely sure. (http://en.wikipedia.org/wiki/Chomsky%E2%80%93Sch%C3%BCtzenberger_theorem) — templatetypedef, May 28 '14 at 23:54
I think the position is non-unique. For instance, you can generate 0+0+0 either as {0+0}+0 or 0+{0+0}, which are two positions in the output list. Of course you can just specify left or right associative and this problem goes away. — jamshidh, May 28 '14 at 23:55
By the way, it looks like the problem is closely related to parsing. I'll bet that you can go from parse tree to position in O(1), and that grammar is O(1).... I can't wrap my brain around the details of how this would be done though. — jamshidh, May 28 '14 at 23:56
@jamshidh: The given grammar is unambiguous; 0+0+0 can only be rightmost-derived as `S->add->add+mul->add+term->add+0->add+mul+0->add+term+0->add+0+0->mul+0+0->term+0+0->0+0+0`. The restriction to rightmost derivations is obviously necessary in order to enumerate unique sentences. An ambiguous grammar won't work as nicely. — rici, May 29 '14 at 00:13
@rici- My bad, you are correct.... Left associativity is built into the grammar, I didn't notice that. I now more strongly think that my second comment is correct though. When I consider the simpler case with just `number:= 0|1, sum := number | sum + number`, it looks like the position is (something like) `2^(#sum)+b`, where `#sum=` and `b=`. I still don't want to wrap my brain around the full question though, I am happy to just have convinced myself that this is probably the correct answer. :) — jamshidh, May 29 '14 at 00:37
@jamshidh: It's not quite that simple, since length matters; `0+0` and `0+0+0` are different sentences. If you don't have any empty right-hand-sides (and there is a simple algorithm to eliminate them), then you can indeed enumerate in length order, and consequently you can derive a unique number from the parse tree (more easily if you don't require the numbering to be compact). — rici, May 29 '14 at 00:42
@rici- Wouldn't any elimination algorithm potentially change the ordering of the generated expressions? At any rate, am I correct in assuming that we are on the same page whenever we don't have empty right hand sides (ie- an O(1) algorithm exists)? — jamshidh, May 29 '14 at 01:09
@jamshidh: By elimination algorithm, I meant an algorithm to modify the grammar so that it doesn't have any empty RHS. You'd have to use the modified grammar for both generation and indexing, yes. But you cannot parse an arbitrary sentence in O(1); for unambiguous LR-parseable grammars, it's `O(n)` in the length of the sentence, and in general it's `O(n^3)`. — rici, May 29 '14 at 01:49
@rici- I got what you meant by elimination algorithm, and of course I was putting the exponent (not term) in my O, just read O(1)=O(n^1) everywhere above. Certain algorithms (ie- the Earley algorithm) guarantees at worst O(n^3), but I've always been able to rewrite context free grammars as O(n).... (not sure if this is always possible though, but certainly seems to be so for any practical grammar). Of course now that I write this, I realize that my rewriting changes the order of generated senetences much like your rewrite above. :) — jamshidh, May 29 '14 at 02:22
This is exactly the problem that serialization libraries solve: how do we compactly represent a given ADT (which you can see as a grammar) as bits, and recover the ADT efficiently from those bits? Normally, we do not expect that *every* bit string correspond to some value; that makes the problem much harder. But you might be interested in the paper [Every Bit Counts](http://www.msr-waypoint.com/en-us/um/people/akenn/fun/jfpcoding.pdf). — Daniel Wagner, May 29 '14 at 02:39

score 12 · Accepted Answer · answered May 29 '14 at 04:09

For this specific problem, we can cook up something fairly simple, if we allow ourselves to choose a different enumeration ordering. The idea is basically the one in Every Bit Counts, which I also mentioned in the comments. First, some preliminaries: some imports/extensions, a data type representing the grammar, and a pretty-printer. For the sake of simplicity, my digits only go up to 2 (big enough to not be binary any more, but small enough not to wear out my fingers and your eyes).

{-# LANGUAGE TypeSynonymInstances #-}
import Control.Applicative
import Data.Universe.Helpers

type S      = Add
data Add    = Mul    Mul    | Add :+ Mul       deriving (Eq, Ord, Show, Read)
data Mul    = Term   Term   | Mul :* Term      deriving (Eq, Ord, Show, Read)
data Term   = Number Number | Parentheses S    deriving (Eq, Ord, Show, Read)
data Number = Digit  Digit  | Digit ::: Number deriving (Eq, Ord, Show, Read)
data Digit  = D0 | D1 | D2                     deriving (Eq, Ord, Show, Read, Bounded, Enum)

class PP a where pp :: a -> String
instance PP Add where
    pp (Mul m) = pp m
    pp (a :+ m) = pp a ++ "+" ++ pp m
instance PP Mul where
    pp (Term t) = pp t
    pp (m :* t) = pp m ++ "*" ++ pp t
instance PP Term where
    pp (Number n) = pp n
    pp (Parentheses s) = "(" ++ pp s ++ ")"
instance PP Number where
    pp (Digit d) = pp d
    pp (d ::: n) = pp d ++ pp n
instance PP Digit where pp = show . fromEnum

Now let's define the enumeration order. We'll use two basic combinators, +++ for interleaving two lists (mnemonic: the middle character is a sum, so we're taking elements from either the first argument or the second) and +*+ for the diagonalization (mnemonic: the middle character is a product, so we're taking elements from both the first and second arguments). More information on these in the universe documentation. One invariant we'll maintain is that our lists -- with the exception of digits -- are always infinite. This will be important later.

ss    = adds
adds  = (Mul    <$> muls   ) +++ (uncurry (:+) <$> adds +*+ muls)
muls  = (Term   <$> terms  ) +++ (uncurry (:*) <$> muls +*+ terms)
terms = (Number <$> numbers) +++ (Parentheses <$> ss)
numbers = (Digit <$> digits) ++ interleave [[d ::: n | n <- numbers] | d <- digits]
digits  = [D0, D1, D2]

Let's see a few terms:

*Main> mapM_ (putStrLn . pp) (take 15 ss)
0
0+0
0*0
0+0*0
(0)
0+0+0
0*(0)
0+(0)
1
0+0+0*0
0*0*0
0*0+0
(0+0)
0+0*(0)
0*1

Okay, now let's get to the good bit. Let's assume we have two infinite lists a and b. There's two things to notice. First, in a +++ b, all the even indices come from a, and all the odd indices come from b. So we can look at the last bit of an index to see which list to look in, and the remaining bits to pick an index in that list. Second, in a +*+ b, we can use the standard bijection between pairs of numbers and single numbers to translate between indices in the big list and pairs of indices in the a and b lists. Nice! Let's get to it. We'll define a class for Godel-able things that can be translated back and forth between numbers -- indices into the infinite list of inhabitants. Later we'll check that this translation matches the enumeration we defined above.

type Nat = Integer -- bear with me here
class Godel a where
    to :: a -> Nat
    from :: Nat -> a

instance Godel Nat where to = id; from = id

instance (Godel a, Godel b) => Godel (a, b) where
    to (m_, n_) = (m + n) * (m + n + 1) `quot` 2 + m where
        m = to m_
        n = to n_
    from p = (from m, from n) where
        isqrt    = floor . sqrt . fromIntegral
        base     = (isqrt (1 + 8 * p) - 1) `quot` 2
        triangle = base * (base + 1) `quot` 2
        m = p - triangle
        n = base - m

The instance for pairs here is the standard Cantor diagonal. It's just a bit of algebra: use the triangle numbers to figure out where you're going/coming from. Now building up instances for this class is a breeze. Numbers are just represented in base 3:

-- this instance is a lie! there aren't infinitely many Digits
-- but we'll be careful about how we use it
instance Godel Digit where
    to = fromIntegral . fromEnum
    from = toEnum . fromIntegral

instance Godel Number where
    to (Digit d) = to d
    to (d ::: n) = 3 + to d + 3 * to n
    from n
        | n < 3     = Digit (from n)
        | otherwise = let (q, r) = quotRem (n-3) 3 in from r ::: from q

For the remaining three types, we will, as suggested above, check the tag bit to decide which constructor to emit, and use the remaining bits as indices into a diagonalized list. All three instances necessarily look very similar.

instance Godel Term where
    to (Number n) = 2 * to n
    to (Parentheses s) = 1 + 2 * to s
    from n = case quotRem n 2 of
        (q, 0) -> Number (from q)
        (q, 1) -> Parentheses (from q)

instance Godel Mul where
    to (Term t) = 2 * to t
    to (m :* t) = 1 + 2 * to (m, t)
    from n = case quotRem n 2 of
        (q, 0) -> Term (from q)
        (q, 1) -> uncurry (:*) (from q)

instance Godel Add where
    to (Mul m) = 2 * to m
    to (m :+ t) = 1 + 2 * to (m, t)
    from n = case quotRem n 2 of
        (q, 0) -> Mul (from q)
        (q, 1) -> uncurry (:+) (from q)

And that's it! We can now "efficiently" translate back and forth between parse trees and their Godel numbering for this grammar. Moreover, this translation matches the above enumeration, as you can verify:

*Main> map from [0..29] == take 30 ss
True

We did abuse many nice properties of this particular grammar -- non-ambiguity, the fact that almost all the nonterminals had infinitely many derivations -- but variations on this technique can get you quite far, especially if you are not too strict on requiring every number to be associated with something unique.

Also, by the way, you might notice that, except for the instance for (Nat, Nat), these Godel numberings are particularly nice in that they look at/produce one bit (or trit) at a time. So you could imagine doing some streaming. But the (Nat, Nat) one is pretty nasty: you have to know the whole number ahead of time to compute the sqrt. You actually can turn this into a streaming guy, too, without losing the property of being dense (every Nat being associated with a unique (Nat, Nat)), but that's a topic for another answer...

I hope all the karma you get from my questions suffices to pay for everything you teach me, or else I'm in a big debit. — MaiaVictor, Jun 07 '14 at 16:49
@Viclib I'm thrilled you're learning! That's payback enough for me. — Daniel Wagner, Jun 07 '14 at 17:21

Is there a fast algorithm to determine the godel number of a term of a context free language?

1 Answers1

Linked