Phrase extraction algorithm for statistical machine translation

Question

I have written the following code with the phrase extraction algorithm for SMT.

# -*- coding: utf-8 -*-

def phrase_extraction(srctext, trgtext, alignment):
    """
    Phrase extraction algorithm. 
    """
    def extract(f_start, f_end, e_start, e_end):
        phrases = set()
        # return { } if f end == 0
        if f_end == 0:
            return
        # for all (e,f) ∈ A do
        for e,f in alignment:
            # return { } if e < e start or e > e end
            if e < e_start or e > e_end:        
                return

        fs = f_start
        # repeat-
        while True:
            fe = f_end
            # repeat-
            while True:
                # add phrase pair ( e start .. e end , f s .. f e ) to set E
                trg_phrase = " ".join(trgtext[i] for i in range(fs,fe))
                src_phrase = " ".join(srctext[i] for i in range(e_start,e_end))
                phrases.add("\t".join([src_phrase, trg_phrase]))
                fe+=1 # fe++
                # -until fe aligned
                if fe in f_aligned or fe > trglen:
                    break
            fs-=1 # fe--
            # -until fs aligned
            if fs in f_aligned or fs < 0:
                break
        return phrases

    # Calculate no. of tokens in source and target texts.
    srctext = srctext.split()
    trgtext = trgtext.split()
    srclen = len(srctext)
    trglen = len(trgtext)
    # Keeps an index of which source/target words are aligned.
    e_aligned = [i for i,_ in alignment]
    f_aligned = [j for _,j in alignment] 

    bp = set() # set of phrase pairs BP
    # for e start = 1 ... length(e) do
    for e_start in range(srclen):
        # for e end = e start ... length(e) do       
        for e_end in range(e_start, srclen):
            # // find the minimally matching foreign phrase
            # (f start , f end ) = ( length(f), 0 )
            f_start, f_end = trglen, 0
            # for all (e,f) ∈ A do
            for e,f in alignment:
                # if e start ≤ e ≤ e end then
                if e_start <= e <= e_end:
                    f_start = min(f, f_start)
                    f_end = max(f, f_end)
            # add extract (f start , f end , e start , e end ) to set BP
            phrases = extract(f_start, f_end, e_start, e_end)
            if phrases:
                bp.update(phrases)
    return bp

srctext = "michael assumes that he will stay in the house"
trgtext = "michael geht davon aus , dass er im haus bleibt"
alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), (5,9), (6,7), (7,7), (8,8)]

phrases = phrase_extraction(srctext, trgtext, alignment)

for i in phrases:
    print i

The phrase extraction algorithm from Philip Koehn's Statistical Machine Translation book, page 133 is as such:

enter image description here

And the desired output should be:

enter image description here

However with my code, I am only able to get these output:

michael assumes that he will stay in the - michael geht davon aus , dass er im haus

michael assumes that he will stay in the - michael geht davon aus , dass er im haus bleibt

Does anyone spot what is wrong with my implementation?

The code does extract phrases but it's not the complete desired output as shown with the translation table above:

enter image description here

Why downvote? it was on codereview and i've no idea why it came here... — alvas, Aug 03 '14 at 21:00
@jamal, i'm pretty sure the code is working to get the output... — alvas, Aug 03 '14 at 21:02
It was migrated from CR as non-working code is off-topic there. — jonrsharpe, Aug 03 '14 at 21:11
it's a working code, i'm pretty sure, i can post a printscreen of my console output... — alvas, Aug 03 '14 at 21:11
Then why do you ask *"what is wrong with my implementation"*? **Running** is not the same as **working** - if the output is wrong this is non-working code. — jonrsharpe, Aug 03 '14 at 21:12
it's not getting the correct output, although the implementation of the algorithm looks right. — alvas, Aug 03 '14 at 21:13
@alvas if it's not getting the correct output, it's clearly not working. — TheSoundDefense, Aug 03 '14 at 21:19
It may look right, but if the output is wrong either the input or the algorithm is wrong, and I'd really hope you double-checked the input! — jonrsharpe, Aug 03 '14 at 21:20
So much code, so few tests... I can reproduce your results and adding a print to the inner loop of `extract` shows it runs exactly twice. Given this I wonder what is really meant by `for all (e,f) ∈ A do`? It appears twice and you translate it to a for-loop in both cases. How is this motivated? — FredrikHedman, Aug 03 '14 at 21:47
@FredrikHedman `for all (e,f) ∈ A do` refers to iterating through all possible alignment points, in this case, it's the input list of tuples — alvas, Aug 03 '14 at 21:49
A trace of the example as it runs shows that `extract` is called 45 times and finds two phrases. Most of these calls never get to the point where phrases are generated. This points to the parameters being wrong (too few) or the logic of early return being wrong. I also notice the algorithm uses 1-based indexing and the code 0-based. Suggestion: write tests that demonstrate that the different parts work as expectd... — FredrikHedman, Aug 03 '14 at 23:05
Not good to ask one Q twice : http://stackoverflow.com/questions/24970434/how-to-get-phrase-tables-from-word-alignments?noredirect=1#comment39082169_24970434 — Daniel, Aug 04 '14 at 10:22
@Daniel, it's actually a different question, the other one ask how to get phrases from word alignments and this question ask what is wrong with the code implementation or the algorithm. — alvas, Aug 04 '14 at 11:44

score 8 · Accepted Answer · edited Feb 20 '17 at 19:18

So I have had a look at this problem and can now reproduce the desired output. Turns out there are a number of problems:

The algorithm is not quite complete. In the online version of the book (3rd printing 2012) line 4 of the extract function has been updated. (Maybe there is an Errata)
The algorithm assumes indexing starting from 1 up to and including the end.
Python assumes 0 based indexing up to but not including the end.
In particular this affects the stopping values of some calls to range and a number of comparisons.
The items in the set have been modified to allow for easier reproduction of the desired output.

Example output (matching 19 phrases with some matches repeated 5 times, in total 24 extracted):

$ python2.7 phrase_extract_new.py 
( 1) (0, 1) michael — michael
( 2) (0, 2) michael assumes — michael geht davon aus ; michael geht davon aus ,
( 3) (0, 3) michael assumes that — michael geht davon aus , dass
( 4) (0, 4) michael assumes that he — michael geht davon aus , dass er
( 5) (0, 9) michael assumes that he will stay in the house — michael geht davon aus , dass er im haus bleibt
( 6) (1, 2) assumes — geht davon aus ; geht davon aus ,
( 7) (1, 3) assumes that — geht davon aus , dass
( 8) (1, 4) assumes that he — geht davon aus , dass er
( 9) (1, 9) assumes that he will stay in the house — geht davon aus , dass er im haus bleibt
(10) (2, 3) that — dass ; , dass
(11) (2, 4) that he — dass er ; , dass er
(12) (2, 9) that he will stay in the house — dass er im haus bleibt ; , dass er im haus bleibt
(13) (3, 4) he — er
(14) (3, 9) he will stay in the house — er im haus bleibt
(15) (4, 6) will stay — bleibt
(16) (4, 9) will stay in the house — im haus bleibt
(17) (6, 8) in the — im
(18) (6, 9) in the house — im haus
(19) (8, 9) house — haus
$ python2.7 phrase_extract_new.py | grep -c ';'
5

Below follows the suggested interpretation of the algorithm. This algorithm needs to be refactored quite a bit, but before that more testing is needed with different examples. Reproducing the example in the book is just a start:

# -*- coding: utf-8 -*-

def phrase_extraction(srctext, trgtext, alignment):
    """
    Phrase extraction algorithm.
    """
    def extract(f_start, f_end, e_start, e_end):
        if f_end < 0:  # 0-based indexing.
            return {}
        # Check if alignement points are consistent.
        for e,f in alignment:
            if ((f_start <= f <= f_end) and
               (e < e_start or e > e_end)):
                return {}

        # Add phrase pairs (incl. additional unaligned f)
        # Remark:  how to interpret "additional unaligned f"?
        phrases = set()
        fs = f_start
        # repeat-
        while True:
            fe = f_end
            # repeat-
            while True:
                # add phrase pair ([e_start, e_end], [fs, fe]) to set E
                # Need to +1 in range  to include the end-point.
                src_phrase = " ".join(srctext[i] for i in range(e_start,e_end+1))
                trg_phrase = " ".join(trgtext[i] for i in range(fs,fe+1))
                # Include more data for later ordering.
                phrases.add(((e_start, e_end+1), src_phrase, trg_phrase))
                fe += 1 # fe++
                # -until fe aligned or out-of-bounds
                if fe in f_aligned or fe == trglen:
                    break
            fs -=1  # fe--
            # -until fs aligned or out-of- bounds
            if fs in f_aligned or fs < 0:
                break
        return phrases

    # Calculate no. of tokens in source and target texts.
    srctext = srctext.split()   # e
    trgtext = trgtext.split()   # f
    srclen = len(srctext)       # len(e)
    trglen = len(trgtext)       # len(f)
    # Keeps an index of which source/target words are aligned.
    e_aligned = [i for i,_ in alignment]
    f_aligned = [j for _,j in alignment]

    bp = set() # set of phrase pairs BP
    # for e start = 1 ... length(e) do
    # Index e_start from 0 to len(e) - 1
    for e_start in range(srclen):
        # for e end = e start ... length(e) do
        # Index e_end from e_start to len(e) - 1
        for e_end in range(e_start, srclen):
            # // find the minimally matching foreign phrase
            # (f start , f end ) = ( length(f), 0 )
            # f_start ∈ [0, len(f) - 1]; f_end ∈ [0, len(f) - 1]
            f_start, f_end = trglen-1 , -1  #  0-based indexing
            # for all (e,f) ∈ A do
            for e,f in alignment:
                # if e start ≤ e ≤ e end then
                if e_start <= e <= e_end:
                    f_start = min(f, f_start)
                    f_end = max(f, f_end)
            # add extract (f start , f end , e start , e end ) to set BP
            phrases = extract(f_start, f_end, e_start, e_end)
            if phrases:
                bp.update(phrases)
    return bp

# Refer to match matrix.
#             0      1      2   3  4     5   6   7    8
srctext = "michael assumes that he will stay in the house"
#             0      1    2    3  4  5   6  7   8     9
trgtext = "michael geht davon aus , dass er im haus bleibt"
alignment = [(0,0), (1,1), (1,2), (1,3), (2,5), (3,6), (4,9), (5,9), (6,7), (7,7), (8,8)]

phrases = phrase_extraction(srctext, trgtext, alignment)

# Keep track of translations of each phrase in srctext and its
# alignement using a dictionary with keys as phrases and values being
# a list [e_alignement pair, [f_extractions, ...] ]
dlist = {}
for p, a, b in phrases:
    if a in dlist:
        dlist[a][1].append(b)
    else:
        dlist[a] = [p, [b]]

# Sort the list of translations based on their length.  Shorter phrases first.
for v in dlist.values():
    v[1].sort(key=lambda x: len(x))


# Function to help sort according to book example.
def ordering(p):
    k,v = p
    return v[0]
#
for i, p in enumerate(sorted(dlist.items(), key = ordering), 1):
    k, v = p
    print "({0:2}) {1} {2} — {3}".format( i, v[0], k, " ; ".join(v[1]))

The final part where the output is prepared can probably be improved...and the algorithm code can certaily be made more Pythonic.

Phrase extraction algorithm for statistical machine translation

1 Answers1

Linked