2

I'm trying to implement the CYK pseudo code provided by wikipedia. The example sentence I input should be outputting True, however it is outputting false. I think I'm getting issues on the indexing considering the provided example starts from 1.

Code:

def is_in_language(self, tokens):
    n = len(tokens)
    rules = self.grammar.lhs_to_rules
    table = defaultdict(lambda: defaultdict(dict))
    #Initialize dictionary table[row][column][nonterminal r] = boolean
    for row in range(n+1):
        for col in range(n+1):
            for r in rules:
                table[row][col][r] = False

    for i in range(n):
        nonTerminalList = self.grammar.rhs_to_rules[(tokens[i],)]
        print(nonTerminalList)
        for nonTerminal in nonTerminalList:
            (r,right,prob) = nonTerminal
            table[0][i][r] =  True


    for l in range(2,n+1):
        for s in range(n-l+1):
            for p in range(l-1+1):
                for B in rules:
                    for C in rules:
                        AList = self.grammar.rhs_to_rules[B,C]
                        if(len(AList) > 0):
                            for A in AList:
                                (leftA, rightBC, prob) = A


                                try:
                                    if(table[p][s][B] and table[l-p][s+p][C]):
                                        table[l][s][leftA] = True
                                except:
                                    pass

    print(table[n][0][self.grammar.startsymbol])
    return table
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
user
  • 854
  • 2
  • 12
  • 28

1 Answers1

-1

The following code in python implements the CYK dynamic programming algorithm (described here), which can be used to solve the membership problem for CFG, i.e., given an input string w and a CFG grammar G in chomosky normal form (CNF) it can find whether w is in L(G) in O(n^3|w|) time.

import numpy as np
import pandas as pd

def is_in_cartesian_prod(x, y, r):
    return r in [i+j for i in x.split(',') for j in y.split(',')]

def accept_CYK(w, G, S):
    if w == 'ε':
        return 'ε' in G[S]
    n = len(w)
    DP_table = [['']*n for _ in range(n)]
    for i in range(n):
        for lhs in G.keys():
            for rhs in G[lhs]:
                 if w[i] == rhs: # rules of the form A -> a
                    DP_table[i][i] = lhs if not DP_table[i][i] else DP_table[i][i] + ',' + lhs
                    
    for l in range(2, n+1):       # span
        for i in range(n-l+1):    # start
            j = i+l-1             # right
            for k in range(i, j): # partition
                for lhs in G.keys():
                    for rhs in G[lhs]:
                        if len(rhs) == 2: #rules of form A -> BC
                            if is_in_cartesian_prod(DP_table[i][k], DP_table[k+1][j], rhs):
                                if not lhs in DP_table[i][j]:
                                    DP_table[i][j] = lhs if not DP_table[i][j] else DP_table[i][j] + ',' + lhs

    return S in DP_table[0][n-1]  

Now, let's use the above implementation of algorithm for the following simple CFG G (already in CNF):

S -> AB | BC

A -> BA | a

B -> CC | b

C -> AB | a

and the input string w = baaba to test membership of w in L(G).

# let's define the grammar productions and symbols first 
NTs = ['S', 'A', 'B', 'C', 'D']
Ts = ['a', 'b']
rules = ['S -> AB | BC', 'A -> BA | a', 'B -> CC | b', 'C -> AB | a'] #, 'D -> ϵ']
G = get_grammar(rules)
print(G)
# {'S': ['AB', 'BC'], 'A': ['BA', 'a'], 'B': ['CC', 'b'], 'C': ['AB', 'a']}

# now check if the string w is a member of G
accept_CYK('baaba', G, 'S')
# True

enter image description here

The following animation shows how the DP table gets constructed:

enter image description here

A probabilistic version of this DP algorithm with back-pointers can be used for the PCFGs to construct parse trees for natural language sentences in NLP.

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63