Why such a bad performance for Moses using Europarl?

Question

I have started playing around with Moses and tried to make what I believe would be a fairly standard baseline system. I have basically followed the steps described on the website, but instead of using news-commentary I have used Europarl v7 for training, with the WMT 2006 development set and the original Europarl common test. My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.

To summarise, my workflow was more or less this:

tokenizer.perl on everything
lowercase.perl (instead of truecase)
clean-corpus-n.perl
Train IRSTLM model using only French data from Europarl v7
train-model.perl exactly as described
mert-moses.pl using WMT 2006 dev
Testing and measuring performances as described

And the resulting BLEU score is .26... This leads me to two questions:

Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.
Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?

score 5 · Accepted Answer · answered May 07 '15 at 09:40

Just to put things straight first: the .68 you are referring to has nothing to do with BLEU.

My idea was to do something similar to Le Nagard & Koehn (2010), who obtained a BLEU score of .68 in their baseline English-to-French system.

The article you refer to only states that 68% of the pronouns (using co-reference resolution) was translated correctly. It nowhere mentions that a .68 BLEU score was obtained. As a matter of fact, no scores were given, probably because the qualitative improvement the paper proposes cannot be measured with statistical significance (which happens a lot if you only improve on a small number of words). For this reason, the paper uses a manual evaluation of the pronouns only:

A better evaluation metric is the number of correctly translated pronouns. This requires manual inspection of the translation results.

This is where the .68 comes into play.

Now to answer your questions with respect to the .26 you got:

Is this a typical BLEU score for this kind of baseline system? I realise Europarl is a pretty small corpus to train a monolingual language model on, even though this is how they do things on the Moses website.

Yes it is. You can find the performance of WMT language pairs here http://matrix.statmt.org/

Are there any typical pitfalls for someone just starting with SMT and/or Moses I may have fallen in? Or do researchers like Le Nagard & Koehn build their baseline systems in a way different from what is described on the Moses website, for instance using some larger, undisclosed corpus to train the language model?

I assume that you trained your system correctly. With respect to the "undisclosed corpus" question: members of the academic community normally state for each experiment which data sets were used for training testing and tuning, at least in peer-reviewed publications. The only exception is the WMT task (see for example http://www.statmt.org/wmt14/translation-task.html) where privately owned corpora may be used if the system participates in the unconstrained track. But even then, people will mention that they used additional data.

Damn it, you're totally right w.r.t. Le Nagard & Koehn. I read it too quickly, and got confused by the fact that they mention BLEU but never actually give any numbers. — scozy, May 07 '15 at 13:31

Why such a bad performance for Moses using Europarl?

1 Answers1