I am using tst2013.en found here, as my test sets to get the Test BLEU
score to compare to other previous models. However, I have to filter out some sentences that are longer than 100 words otherwise I won't have the resource to run the model.
But with a slightly modified test sets, is it acceptable to compare the Test BLEU
score to other models that use the unmodified test sets?