4

I found zero-value when I calculate BLEU score for Chinese sentences.

The candidate sentence is c and two references are r1 and r2

c=[u'\u9274\u4e8e', u'\u7f8e\u56fd', u'\u96c6', u'\u7ecf\u6d4e', u'\u4e0e', u'\u8d38\u6613', u'\u6700\u5927', u'\u56fd\u4e8e', u'\u4e00\u8eab', u'\uff0c', u'\u4e0a\u8ff0', u'\u56e0\u7d20', u'\u76f4\u63a5', u'\u5f71\u54cd', u'\u7740', u'\u4e16\u754c', u'\u8d38\u6613', u'\u3002']

r1 = [u'\u8fd9\u4e9b', u'\u76f4\u63a5', u'\u5f71\u54cd', u'\u5168\u7403', u'\u8d38\u6613', u'\u548c', u'\u7f8e\u56fd', u'\u662f', u'\u4e16\u754c', u'\u4e0a', u'\u6700\u5927', u'\u7684', u'\u5355\u4e00', u'\u7684', u'\u7ecf\u6d4e', u'\u548c', u'\u8d38\u6613\u5546', u'\u3002']

r2=[u'\u8fd9\u4e9b', u'\u76f4\u63a5', u'\u5f71\u54cd', u'\u5168\u7403', u'\u8d38\u6613', u'\uff0c', u'\u56e0\u4e3a', u'\u7f8e\u56fd', u'\u662f', u'\u4e16\u754c', u'\u4e0a', u'\u6700\u5927', u'\u7684', u'\u5355\u4e00', u'\u7684', u'\u7ecf\u6d4e\u4f53', u'\u548c', u'\u8d38\u6613\u5546', u'\u3002']

The code is:

weights = [0.1, 0.8, 0.05, 0.05]
print nltk.align.bleu_score.bleu(c, [r1, r2], weights)

But I got a result 0. When I step into the bleu process, I found that

try:
    s = math.fsum(w * math.log(p_n) for w, p_n in zip(weights, p_ns))
except ValueError:
    # some p_ns is 0
    return 0

The above program goes to except ValueError. However, I don't know why this returns an error. If I try other sentences, I can get a non-zero value.

alvas
  • 115,346
  • 109
  • 446
  • 738
flyingmouse
  • 1,014
  • 3
  • 13
  • 29
  • 2
    Stackoverflow warn me `Your question couldn't be submitted. Please see the error above.` No matter how I change the first sentence of my question, it always gives `Boday cannot contain ""` – flyingmouse Jan 01 '16 at 14:42
  • 1
    Actually, I typed Chinese in `c` and `r1` and `r2`, but stackoverflow does not allow me to submit the Chinese characters. I can only explain here. `c` and `r1` and `r2` have common parts. The BLEU score should not be 0. – flyingmouse Jan 01 '16 at 14:44
  • Which version of NLTK are you using can you check with: `import nltk; nltk.__version__`? – alvas Jan 01 '16 at 19:11
  • Issue raised: https://github.com/nltk/nltk/issues/1241 – alvas Jan 01 '16 at 21:03
  • Bug fixed on https://github.com/nltk/nltk/commit/7653786ad77493976a4045c1058541c166999c91 – alvas Jan 24 '16 at 04:15

1 Answers1

4

It seems like you've caught a bug in NLTK implementations! This try-except is wrong at https://github.com/alvations/nltk/blob/develop/nltk/translate/bleu_score.py#L76


In Long:

Firstly, let's go through what the p_n in BLEU score means:

enter image description here

Note that:

  • the Papineni formula is based on a corpus-level BLEU score and the native implementation is using a sentence-level BLEU score (the bleeding edge version of NLTK contains an implementation that follows the Papineni paper to calculate corpus level BLEU).
  • in multi-reference BLEU, the Count_match(ngram) is based on the reference with a higher count (see https://github.com/alvations/nltk/blob/develop/nltk/translate/bleu_score.py#L270).

So the default BLEU score uses n=4 which includes unigrams to 4grams. For each ngrams, lets' calculate the p_n:

>>> from collections import Counter
>>> from nltk import ngrams
>>> hyp = u"鉴于 美国 集 经济 与 贸易 最大 国于 一身 , 上述 因素 直接 影响 着 世界 贸易 。".split()
>>> ref1 = u"这些 直接 影响 全球 贸易 和 美国 是 世界 上 最大 的 单一 的 经济 和 贸易商 。".split()
>>> ref2 = u"这些 直接 影响 全球 贸易 和 美国 是 世界 上 最大 的 单一 的 经济 和 贸易商 。".split()
# Calculate p_1, p_2, p_3 and p_4
>>> from nltk.translate.bleu_score import _modified_precision
>>> p_1 = _modified_precision([ref1, ref2], hyp, 1)
>>> p_2 = _modified_precision([ref1, ref2], hyp, 2)
>>> p_3 = _modified_precision([ref1, ref2], hyp, 3)
>>> p_4 = _modified_precision([ref1, ref2], hyp, 4)
>>> p_1, p_2, p_3, p_4
(Fraction(4, 9), Fraction(1, 17), Fraction(0, 1), Fraction(0, 1))

Note the latest version of _modified_precision in BLEU score since this https://github.com/nltk/nltk/pull/1229 has been using Fraction instead of float outputs. So now, we can clearly see the numerator and the denominator.

So let's now verify the outputs from the _modified_precision for unigram. In the hypothesis the bold words occurs in the references:

  • 鉴于 美国经济贸易 最大 国于 一身 , 上述 因素 直接 影响世界 贸易

There are 9 tokens overlapping with 1 of the 9 is a duplicate that occurs twice.

>>> from collections import Counter
>>> ref1_unigram_counts = Counter(ngrams(ref1, 1))
>>> ref2_unigram_counts = Counter(ngrams(ref2, 1))
>>> hyp_unigram_counts = Counter(ngrams(hyp,1))
>>> for overlaps in set(hyp_unigram_counts.keys()).intersection(ref1_unigram_counts.keys()):
...     print " ".join(overlaps)
... 
美国
直接
经济
影响
。
最大
世界
贸易
>>> overlap_counts = Counter({ng:hyp_unigram_counts[ng] for ng in set(hyp_unigram_counts.keys()).intersection(ref1_unigram_counts.keys())})
>>> overlap_counts
Counter({(u'\u8d38\u6613',): 2, (u'\u7f8e\u56fd',): 1, (u'\u76f4\u63a5',): 1, (u'\u7ecf\u6d4e',): 1, (u'\u5f71\u54cd',): 1, (u'\u3002',): 1, (u'\u6700\u5927',): 1, (u'\u4e16\u754c',): 1})

Now let's check how many times these overlapping words occurs in the references. Taking the value of the "combined" counters from the different references as our numerator for the p_1 formula. And if the same word occurs in both references, take the maximum count.

>>> overlap_counts_in_ref1 = Counter({ng:ref1_unigram_counts[ng] for ng in set(hyp_unigram_counts.keys()).intersection(ref1_unigram_counts.keys())})
>>> overlap_counts_in_ref2 = Counter({ng:ref2_unigram_counts[ng] for ng in set(hyp_unigram_counts.keys()).intersection(ref1_unigram_counts.keys())})
>>> overlap_counts_in_ref1
Counter({(u'\u7f8e\u56fd',): 1, (u'\u76f4\u63a5',): 1, (u'\u7ecf\u6d4e',): 1, (u'\u5f71\u54cd',): 1, (u'\u3002',): 1, (u'\u6700\u5927',): 1, (u'\u4e16\u754c',): 1, (u'\u8d38\u6613',): 1})
>>> overlap_counts_in_ref2
Counter({(u'\u7f8e\u56fd',): 1, (u'\u76f4\u63a5',): 1, (u'\u7ecf\u6d4e',): 1, (u'\u5f71\u54cd',): 1, (u'\u3002',): 1, (u'\u6700\u5927',): 1, (u'\u4e16\u754c',): 1, (u'\u8d38\u6613',): 1})
>>> overlap_counts_in_ref1_ref2 = Counter()
>>> numerator = overlap_counts_in_ref1_ref2
>>> 
>>> for c in [overlap_counts_in_ref1, overlap_counts_in_ref2]:
...     for k in c:
...             numerator[k] = max(numerator.get(k,0), c[k])
... 
>>> numerator
Counter({(u'\u7f8e\u56fd',): 1, (u'\u76f4\u63a5',): 1, (u'\u7ecf\u6d4e',): 1, (u'\u5f71\u54cd',): 1, (u'\u3002',): 1, (u'\u6700\u5927',): 1, (u'\u4e16\u754c',): 1, (u'\u8d38\u6613',): 1})
>>> sum(numerator.values())
8

Now for the denominator, it's simply the no. of unigrams that appears in the hypothesis:

>>> hyp_unigram_counts
Counter({(u'\u8d38\u6613',): 2, (u'\u4e0e',): 1, (u'\u7f8e\u56fd',): 1, (u'\u56fd\u4e8e',): 1, (u'\u7740',): 1, (u'\u7ecf\u6d4e',): 1, (u'\u5f71\u54cd',): 1, (u'\u56e0\u7d20',): 1, (u'\u4e16\u754c',): 1, (u'\u3002',): 1, (u'\u4e00\u8eab',): 1, (u'\u6700\u5927',): 1, (u'\u9274\u4e8e',): 1, (u'\u4e0a\u8ff0',): 1, (u'\u96c6',): 1, (u'\u76f4\u63a5',): 1, (u'\uff0c',): 1})
>>> sum(hyp_unigram_counts.values())
18

So the resulting fraction is 8/18 -> 4/9 and our _modified_precision function checks out.

Now lets get to the full BLEU formula:

enter image description here

From the formula let's consider only the exponential of the summation for now, i.e exp(...). It can be also simplified as the sum of the logarithm of the various p_n as we calculated previously, i.e. sum(log(p_n)). And that is how it is implemented in NLTK, see https://github.com/alvations/nltk/blob/develop/nltk/translate/bleu_score.py#L79

Ignoring the BP for now, let's consider summing the p_n and taking their respective weights into consideration:

>>> from fractions import Fraction
>>> from math import log
>>> log(Fraction(4, 9))
-0.8109302162163288
>>> log(Fraction(1, 17))
-2.833213344056216
>>> log(Fraction(0, 1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: math domain error

Ah ha! That's where the error appears and the sum of the logs would have returned a ValueError when putting them through math.fsum().

>>> try:
...     sum(log(pi) for pi in (Fraction(4, 9), Fraction(1, 17), Fraction(0, 1), Fraction(0, 1)))
... except ValueError:
...     0
... 
0

To correct the implementation, the try-except should have been:

s = []
# Calculates the overall modified precision for all ngrams.
# by summing the the product of the weights and the respective log *p_n*
for w, p_n in zip(weights, p_ns)):
    try:
        s.append(w * math.log(p_n))
    except ValueError:
        # some p_ns is 0
        s.append(0)
 return sum(s)

References:

The formulas comes from http://lotus.kuee.kyoto-u.ac.jp/WAT/papers/submissions/W15/W15-5009.pdf that describes some sensitivity issues with BLEU.

alvas
  • 115,346
  • 109
  • 446
  • 738
  • thank you so much for your detailed explanation. I update my nltk from 3.0.5 to 3.1, and I found you already update the bleu_score.py file. The problem now is solved. Thanks again. – flyingmouse Jan 02 '16 at 16:37