2

I've trained OpenNLP-py models from English to German and from Italian to German on Europarl and I got very low BLEU scores: 8.13 for English -> German and 4.79 for Italian -> German.

As I'm no expert in NNs (yet), I adopted the default configurations provided by the library. Training 13 epochs took in both cases approximately 20 hours. In both cases I used 80% of the dataset for training, 10% for validation, and 10% for testing.

Below are the commands I used for creating the Italian -> German model, I used a similar sequence of commands for the other model. Can anybody give me any advice on how to improve the effectiveness of my models?

# $ wc -l Europarl.de-it.de
# 1832052 Europarl.de-it.de

head -1465640 Europarl.de-it.de > train_de-it.de
head -1465640 Europarl.de-it.it > train_de-it.it

tail -n 366412 Europarl.de-it.de | head -183206 > dev_de-it.de
tail -n 366412 Europarl.de-it.it | head -183206 > dev_de-it.it

tail -n 183206 Europarl.de-it.de > test_de-it.de
tail -n 183206 Europarl.de-it.it > test_de-it.it

perl tokenizer.perl -a -no-escape -l de < ../data/train_de-it.de > ../data/train_de-it.atok.de
perl tokenizer.perl -a -no-escape -l de < ../data/dev_de-it.de > ../data/dev_de-it.atok.de
perl tokenizer.perl -a -no-escape -l de < ../data/test_de-it.de > ../data/test_de-it.atok.de

perl tokenizer.perl -a -no-escape -l it < ../data/train_de-it.it > ../data/train_de-it.atok.it
perl tokenizer.perl -a -no-escape -l it < ../data/dev_de-it.it > ../data/dev_de-it.atok.it
perl tokenizer.perl -a -no-escape -l it < ../data/test_de-it.it > ../data/test_de-it.atok.it

python3 preprocess.py \
-train_src ../data/train_de-it.atok.it \
-train_tgt ../data/train_de-it.atok.de \
-valid_src ../data/dev_de-it.atok.it \
-valid_tgt ../data/dev_de-it.atok.de \
-save_data ../data/europarl_de_it.atok.low \
-lower

python3 train.py \
-data ../data/europarl_de_it.atok.low.train.pt \
-save_model ../models_en_de/europarl_it_de_models \
-gpus 0
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Alberto
  • 597
  • 3
  • 17
  • You can get a lot of hints at [Training Romance Multi-Way model](http://forum.opennmt.net/t/training-romance-multi-way-model/86) and also [Training English-German WMT15 NMT engine](http://forum.opennmt.net/t/training-english-german-wmt15-nmt-engine/29/33). The main idea is to run BPE tokenization on a concatenated ENDE training corpus and then tokenize the training corpora with the learned BPE models. `-case_feature` is also a good idea for all languages where letters can have different case. – Wiktor Stribiżew Jul 28 '17 at 07:05
  • that's quite some interesting material to read! Thanks for your comment and for the pointers. Please, make an answer out of it so that I can up-vote and mark as solution – Alberto Jul 28 '17 at 07:53
  • If you come up with a step by step code, please also feel free to post as a separate answer. Or I will edit my answer after my tests are finished. – Wiktor Stribiżew Jul 28 '17 at 08:02
  • Thanks! It'll take me a while to dig into the technical details of your pointers and run everything. I'll wait for your update to accept your answer. Thanks a lot, again :) – Alberto Jul 28 '17 at 09:20

1 Answers1

2

You can get a lot of hints at Training Romance Multi-Way model and also Training English-German WMT15 NMT engine. The main idea is to run BPE tokenization on a concatenated XXYY training corpus and then tokenize the training corpora with the learned BPE models.

The Byte Pair Encoding tokenization should be beneficial for German because of its compounding, the algorithm helps to segment words into subword units. The trick is that you need to train a BPE model on a single training corpus containing both source and target. See Jean Senellart's comment:

The BPE model should be trained on the training corpus only - and ideally, you train one single model for source and target so that the model learns easily to translate identical word fragments from source to target. So I would concatenate source and target training corpus - then train tokenize it once, then learn a BPE model on this single corpus, that you then use for tokenization of test/valid/train corpus in source and target.

Another idea is to tokenize with -case_feature. It is also a good idea for all languages where letters can have different case. See Jean's comment:

in general using -case_feature is a good idea for almost all languages (with case) - and shows good performance for dealing and rendering in target case variation in the source (for instance all uppercase/lowercase, or capitalized words, ...).

To improve MT quality, you might also try

  1. Getting more corpora (e.g. WMT16 corpora)
  2. Tune using in-domain training
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563