5

I have to check if a string can be derived from a given context free that is in Chomsky normal form. I'm using C++.

There is very nice pseudocode on the Wikipedia article covering the CYK algorithm, but I can't understand it very well.

Would someone be so kind to help me out by giving me another pseudocode for CYK algorithm, or maybe explain the one in the wiki article?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
neojb1989
  • 181
  • 1
  • 3
  • 12
  • 1
    As much as I like Wikipedia, it isn't always the most readable source. For technical information for the uninitiated, it's usually best to seek alternate sources. Have you googled other locations for CYK? – RonaldBarzell Dec 05 '12 at 17:10
  • i've done a Google search but I either turn up actual code done by someone on a level that I can't understand, or I find the algorithm for doing it by hand which I have had a had time even beginning to trandlate to code. – neojb1989 Dec 05 '12 at 17:16
  • Yeah, a lot of the links are not very readable. There's a demo if you want to get familiar with it at: http://www.diotavelli.net/people/void/demos/cky.html. Additionally, here's a series of slides that seems more readable: http://www.cs.ucdavis.edu/~rogaway/classes/120/winter12/CYK.pdf. Finally, here's a C++ implementation: http://nitishkr.wordpress.com/2011/03/29/cyk-algorithm-implementation/ – RonaldBarzell Dec 05 '12 at 17:53
  • I would like to add that this is all stuff I have found earlier but I gave it a reread since you found it too. I'm not much better off though :/ – neojb1989 Dec 06 '12 at 00:29
  • Sorry to hear that... I think I understand the CYK algorithm, the problem is that it's not necessarily easy to explain as it may need grounding in some other terminology that isn't necessarily CS. If you want to take this to chat, maybe I can help, but it's not anything suitable for an answer as it would involve a back & forth – RonaldBarzell Dec 06 '12 at 13:19
  • Actually I figured it out! After watching that demo over and over again I came up with the for loops that mimic that pattern of checking and it all came together beautifully :) – neojb1989 Dec 07 '12 at 01:18
  • Wonderful! I'm glad to hear it. – RonaldBarzell Dec 07 '12 at 13:31

1 Answers1

6

The CYK algorithm takes as input a CFG that's in Chomsky normal form. That means that every production either has the form

  • S → a, for some terminal a, or
  • S → AB, for some nonterminals A and B.

Now, imagine you have a string w and you want to see whether you can derive it from a grammar whose start symbol is S. There are two options:

  1. If w is a single character long, then the only way to parse it would be to use a production of the form S → a for some character a. So see whether any of the single-character productions would match a.
  2. If w is more than one character long, then the only way to parse it is to use a production of the form S → AB for some nonterminals A and B. That means that we need to divide the string w into two pieces x and y where A derives x and B derives y. One way to do that is to try all possible ways of splitting w into two pieces and to see if any of them work.

Notice that option (2) here ends up being a recursive parsing problem: to see whether you can derive w from S, see whether you can derive x from A and y from B.

With that insight, here's pseudocode for recursive function you can use to see whether a nonterminal S derives a string w:

bool canDerive(nonterminal S, string w) {
    return canDeriveRec(S, w, 0, w.size());
}

/* Can you derive the substring [start, end) of w from S? */
bool canDeriveRec(nonterminal S, string w, int start, int end) {
    /* Base case: Single characters */
    if (end - start == 1) {
        return whether there is a production S -> a, where a = w[start];
    }

    /* Recursive case: Try all possible splits */
    for (each production S -> AB) {
        for (int mid = start + 1; mid < end; mid++) {
            if (canDeriveRec(A, w, start, mid) &&
                canDeriveRec(B, w, mid, end)) {
                return true;
            }
        }
    }
    return false;
}

This algorithm works correctly, but if you map out the shape of the recursion you'll find that

  1. it makes a ton of redundant recursive calls, but
  2. there aren't that many different possible recursive calls.

In fact, the number of distinct possible calls is O(n2 N), where n is the length of the input string (for each possible combination of a start and end index) and N is the number of nonterminals in the grammar. These observations suggest that this algorithm would benefit either from memoization or dynamic programming, depending on which approach you happen to think is nicer.

The CYK algorithm is what you get when you take the above recursive algorithm and memoize the result, or equivalently when you convert the above recursive algorithm into a dynamic programming problem.

There are O(n2 N) possible recursive calls. For each production tried, it does O(n) work. If there are P productions, on average, for a nonterminal, this means the overall runtime is O(n3 NP), which is O(n3) for a fixed grammar.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • Thanks for your great explanation, just wondering if thats a bottom up or top down design as I am bit confused? – SMH Sep 18 '17 at 21:11
  • It's a little bit of both. If you think of it as a dynamic programming algorithm, then it's a bottom-up approach that determines all possible productions that could yield each of the different substrings in increasing order of length. If you think of it as a memoization approach, then it's top-down. Many parsing algorithms have that feel of mixing and matching. – templatetypedef Sep 18 '17 at 21:21
  • Thanks for your reply, the pseudocode here https://en.wikipedia.org/wiki/CYK_algorithm#As_pseudocode is bottom up approach and I am wondering what shall I change to make it top down design? – SMH Sep 18 '17 at 21:26
  • The pseudocode I've given in my answer is a good template for how you'd implement things top-down. Just add memoization and you're done! (Also, while CYK is commonly taught as a good general parsing algorithm, Earley's algorithm is way more flexible - it doesn't need things to be in CNF - and in practice can be a lot faster. One of my TAs implemented it as part of a CFG autograding tool and we've had lots of successes with it.) – templatetypedef Sep 18 '17 at 21:41
  • Thank you, I will add memorization to your answer, and I will add a new question with my new code so it will be great if you could please check it, is that alright? – SMH Sep 18 '17 at 21:49
  • I'd prefer if you left this answer unmodified or let me take a stab at it, but feel free to post your own question! – templatetypedef Sep 18 '17 at 21:53
  • It will be great if you could please update it by adding memorization, as the new question will be almost the same – SMH Sep 18 '17 at 21:55