I'm curious if anyone is familiar with the difference between using NLTK's BLEU score calculation and the SacreBLEU library.
In particular, I'm using both library's sentence BLEU scores, averaged over the entire dataset. The two give different results:
>>> from nltk.translate import bleu_score
>>> from sacrebleu import sentence_bleu
>>> print(len(predictions))
256
>>> print(len(targets))
256
>>> prediction = "this is the first: the world's the world's the world's the \
... world's the world's the world's the world's the world's the world's the world \
... of the world of the world'"
...
>>> target = "al gore: so the alliance for climate change has launched two campaigns."
>>> print(bleu_score.sentence_bleu([target], prediction))
0.05422283394039736
>>> print(sentence_bleu(prediction, [target]).score)
0.0
>>> print(sacrebleu.corpus_bleu(predictions, [targets]).score)
0.678758518214081
>>> print(bleu_score.corpus_bleu([targets], [predictions]))
0
As you can see, there's a lot of confusing inconsistencies going on. There's no way that my BLEU score is 67.8%, but it's also not supposed to be 0% (there are a lot of overlapping n-grams like "the").
I'd appreciate it if anyone could shed some light on this. Thanks.