Why I am getting less BLEU score?

Question

from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'ae', 'test']]
candidate = ['this', 'is', 'ad', 'test']
score = sentence_bleu(reference, candidate)
print(score)

I am using this code to calculate the BLEU score and the score I am getting is 1.0547686614863434e-154. I wander why I am getting so small value even only one letter is different in candidate list.

score = sentence_bleu(reference, candidate,weights = [1])

I tried adding weight = [1] as a parameter and it gave me 0.75 as output. I cant understand why I have to add weight to get a reasonable result. Any help would be appreciated.

I thought its maybe because the sentence is not long enough so I added more words:

from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'ae', 'test','rest','pep','did']]
candidate = ['this', 'is', 'ad', 'test','rest','pep','did']
score = sentence_bleu(reference, candidate)
print(score)

Now I am getting 0.488923022434901 but still I think is too low value.

https://gist.github.com/alvations/e5922afa8c91472d25c58b2d712a93e7 =) — alvas, Mar 03 '23 at 14:39

score 2 · Accepted Answer · answered Mar 03 '23 at 09:28

By default, sentence_bleu is configured with 4 weights: 0.25 for unigram, 0.25 for bigram, 0.25 for trigram, 0.25 for quadrigram. The length of weights give the order of ngram, so the BLEU score is computed for 4 levels of ngrams.

When you use weights=[1], you only analyze unigram:

reference = [['this', 'is', 'ae', 'test','rest','pep','did']]
candidate = ['this', 'is', 'ad', 'test','rest','pep','did']

>>> sentence_bleu(reference, candidate)  # default weights, order of ngrams=4
0.488923022434901

But you can also consider unigrams are more important than bigrams which are more important than tri and quadigrams:

>>> sentence_bleu(reference, candidate, weights=[0.5, 0.3, 0.1, 0.1])
0.6511772622175621

You can also use SmoothingFunction methods and read the docstring from source code to better understanding.

score 0 · Answer 2 · answered Mar 03 '23 at 15:04

BLEU compares word ngrams, not characters. If you are comparing two 4-word ngrams and even a single character differs, you don't have a match. So your first test has no matching 3- or 4-grams, and BLEU is reporting zero similarity with rounding error. The reason is explained in help(sentence_bleu):

...If there is no ngrams overlap for any order of n-grams, BLEU returns the value 0. This is because the precision for the order of n-grams without overlap is 0, and the geometric mean in the final BLEU score computation multiplies the 0 with the precision of other n-grams. This results in 0 (independently of the precision of the other n-gram orders).

So your first 4-gram score is zero and pushes the final score to zero. And the second score is "low" because most ngrams compared are not matches.

That answers your question, which was "why". Read the rest of the help text, which suggests using a smoothing function to avoid this "harsh" behavior.

To avoid this harsh behaviour when no ngram overlaps are found a smoothing function can be used.
>>> chencherry = SmoothingFunction()
>>> sentence_bleu([reference1, reference2, reference3], hypothesis2,
...     smoothing_function=chencherry.method1)

Why I am getting less BLEU score?

2 Answers2