I have been searching for the meaning of numbers in giza++ phrase-table output within the official website (and pdf manual): http://www.statmt.org/moses/?n=FactoredTraining.ScorePhrases
And this is what I've come up to.
Let's say this is a line from phrase-table
načiniti na koji ||| way in which ||| 0.833333 * 0.33333 * ||| * ||| 12 3 1
that means:
e = "načiniti na koji"
f = "way in which"
count(e) = 12
count(f) = 3
count(e, f) = 1
p(f|e) = count(f, e) / count(e) = 1/12 = 0.833333
p(e|f) = count(f, e) / count(f) = 1/3 = 0.333333
These all makes perfect sense.
Yet, if I make a text search with textual editor, I get:
count("načiniti na koji") = 4
count("way in which") = 9
i.e, totally different numbers.
Another strange thing is:
osnivanje i ||| the ||| 0.000124085 * 1 * ||| 0-0 ||| 8059 1 1
so, considering the explanation from the official website,
count("the) = 1,
and
count("osnivanje i") = 8059.
One explanation could be that it might just be opposite.
But, real count("the") is 21466.
Are there some other tutorials/manuals that better clarify content of giza++ output files?