1

The output of my word alignment file looks as such:

I wish to say with regard to the initiative of the Portuguese Presidency that we support the spirit and the political intention behind it . In bezug auf die Initiative der portugiesischen Präsidentschaft möchte ich zum Ausdruck bringen , daß wir den Geist und die politische Absicht , die dahinter stehen , unterstützen .   0-0 5-1 5-2 2-3 8-4 7-5 11-6 12-7 1-8 0-9 9-10 3-11 10-12 13-13 13-14 14-15 16-16 17-17 18-18 16-19 20-20 21-21 19-22 19-23 22-24 22-25 23-26 15-27 24-28
It may not be an ideal initiative in terms of its structure but we accept Mr President-in-Office , that it is rooted in idealism and for that reason we are inclined to support it .    Von der Struktur her ist es vielleicht keine ideale Initiative , aber , Herr amtierender Ratspräsident , wir akzeptieren , daß sie auf Idealismus fußt , und sind deshalb geneigt , sie mitzutragen .   0-0 11-2 8-3 0-4 3-5 1-6 2-7 5-8 6-9 12-11 17-12 15-13 16-14 16-15 17-16 13-17 14-18 17-19 18-20 19-21 21-22 23-23 21-24 26-25 24-26 29-27 27-28 30-29 31-30 33-31 32-32 34-33

How can I produce the phrase tables that are used by MOSES from this output?

In this pdf, it explains the consistent phrase extraction: http://www.inf.ed.ac.uk/teaching/courses/mt/lectures/phrase-model.pdf but what is the algorithm to achieve the phrases? (slide 16-21)

alvas
  • 115,346
  • 109
  • 446
  • 738
  • i've tried iterating all possible sizes of cells with all possible combination. but that will give me `n! * m! * n * m` cells to check through for every sentence, where n and m are length of the source and target sentence. – alvas Jul 26 '14 at 11:16
  • I don't understand your question. Are you trying to get the alignment itself? How does your alignment work? – Daniel Jul 27 '14 at 08:27
  • @Daniel, word alignment != phrase table. I've found the algorithm but it's not working somehow, http://stackoverflow.com/questions/25109001/phrase-extraction-algorithm-for-statistical-machine-translation – alvas Aug 03 '14 at 21:07
  • What do you mean by "not working somehow"? You implemented the algorithm below in the response, and it is giving wrong answers? – Daniel Aug 04 '14 at 03:09
  • yes, it's not giving the right output... – alvas Aug 04 '14 at 06:26
  • well, it seems like the alignment below is just an approximation, and not guaranteed to give consistent results. – Daniel Aug 04 '14 at 10:00
  • Is this a standard input format? Looks pretty ad-hoc and hard to use. – tripleee Aug 09 '14 at 17:03
  • yes, it's the pharaoh output format. One could also prefer the giza output format though, e.g. http://rali.iro.umontreal.ca/rali/?q=en/node/1325#ali. – alvas Aug 09 '14 at 17:11

1 Answers1

3

The way to get a phrase table is to first extract the phrase table with the following algorithm from Philip Koehn's Statistical MT book, pp. 133:

enter image description here

Then estimate the probabilities for the phrases with their relative frequencies, i.e.

enter image description here

Note that there is an error in the original printed version of the book but it's addressed in the errata on line 4 of the extract() function.

Also see Phrase extraction algorithm for statistical machine translation for the details.

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738