So I am using the BLEU score metric to compare my NMT model's performance with existing models. However, I'm wondering how many settings do I have to match with the other models.
Settings like dev sets, test sets and hyperparameters I think are doable. However, the preprocessing step I use is different from existing models and so I'm wondering if the BLEU score of my model can be compared with others. There are also chances that existing models have hidden parameters that were not reported.
https://arxiv.org/pdf/1804.08771.pdf addresses the problem of reporting BLEU and calls to switch to SacreBLEU. But many existing models use BLEU so I don't think I can use the SacreBLEU score metric on my model.