Questions tagged [bleu]

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.

55 questions
57
votes
3 answers

Text Summarization Evaluation - BLEU vs ROUGE

With the results of two different summary systems (sys1 and sys2) and the same reference summaries, I evaluated them with both BLEU and ROUGE. The problem is: All ROUGE scores of sys1 was higher than sys2 (ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4,…
Chelsea_cole
  • 1,055
  • 3
  • 15
  • 21
21
votes
2 answers

NLTK: corpus-level bleu vs sentence-level BLEU score

I have imported nltk in python to calculate BLEU Score on Ubuntu. I understand how sentence-level BLEU score works, but I don't understand how corpus-level BLEU score work. Below is my code for corpus-level BLEU score: import nltk hypothesis =…
Long Le Minh
  • 335
  • 1
  • 2
  • 12
11
votes
2 answers

Variation in BLEU Score

I have some question on BLUE Score calculation for machine translation. I realized they may have a different metrics for BLEU. I found the code reports five value for BLEU, namely BLEU-1, BLEU-2, BLEU-3, BLEU-4 and finally BLEU, which seems to be an…
Jürgen K.
  • 3,427
  • 9
  • 30
  • 66
8
votes
1 answer

What is the difference between mteval-v13a.pl and NLTK BLEU?

There is an implementation of BLEU score in Python NLTK, nltk.translate.bleu_score.corpus_bleu But I am not sure if it is the same as the mtevalv13a.pl script. What is the difference between them?
4
votes
1 answer

What's the difference between NLTK's BLEU score and SacreBLEU?

I'm curious if anyone is familiar with the difference between using NLTK's BLEU score calculation and the SacreBLEU library. In particular, I'm using both library's sentence BLEU scores, averaged over the entire dataset. The two give different…
Sean
  • 2,890
  • 8
  • 36
  • 78
4
votes
1 answer

BLEU scores:could I use nltk.translate.bleu_score.sentence_bleu for calculating scores of bleu in chinese

If I have chinese word list: like reference = ['我', '是', '好' ,'人'], hypothesis = ['我', '是', '善良的','人] . Could I use the: nltk.translate.bleu_score.sentence_bleu(references, hypothesis) for chinese translation? it is the same as English? How about…
tktktk0711
  • 1,656
  • 7
  • 32
  • 59
4
votes
1 answer

Why nltk.align.bleu_score.bleu gives an error?

I found zero-value when I calculate BLEU score for Chinese sentences. The candidate sentence is c and two references are r1 and r2 c=[u'\u9274\u4e8e', u'\u7f8e\u56fd', u'\u96c6', u'\u7ecf\u6d4e', u'\u4e0e', u'\u8d38\u6613', u'\u6700\u5927',…
flyingmouse
  • 1,014
  • 3
  • 13
  • 29
4
votes
0 answers

How to use basic BLEU score in Asiya Machine Translation Evaluation toolkit?

Asiya is the machine translation evaluation toolkit to score machine translation outputs (http://asiya.lsi.upc.edu/). It is largely written in Perl. How do I use Asiya to perform BLEU metrics? I have followed the youtube introduction video:…
alvas
  • 115,346
  • 109
  • 446
  • 738
3
votes
0 answers

cannot compute __inference_pruned_8945 as input #0(zero-based) was expected to be a int64 tensor but is a int32 tensor [Op:__inference_pruned_8945]

I am trying to execute bleurt matric for my task. I am new to bleurt when I try to execute scorer.score I get an error. from bleurt import score checkpoint = r".\bleurt\test_checkpoint" references = ["This is a test."] candidates = ["This is the…
M.K
  • 31
  • 2
3
votes
0 answers

BLEU score value higher than 1

I've been looking at how BLEU score works. What I understood from the online videos + the original research paper is that BLEU score value should be within the range 0-1. Then, when I started to look at some research papers, I found that BLEU value…
Minions
  • 5,104
  • 5
  • 50
  • 91
3
votes
0 answers

Understanding ROUGE vs BLEU

I am looking into metrics for measuring the quality of text-summarization. For this, I have found this SO answer which states: Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human…
MichaelJanz
  • 1,775
  • 2
  • 8
  • 23
3
votes
1 answer

Why such a bad performance for Moses using Europarl?

I have started playing around with Moses and tried to make what I believe would be a fairly standard baseline system. I have basically followed the steps described on the website, but instead of using news-commentary I have used Europarl v7 for…
scozy
  • 2,511
  • 17
  • 34
2
votes
2 answers

Why I am getting less BLEU score?

from nltk.translate.bleu_score import sentence_bleu reference = [['this', 'is', 'ae', 'test']] candidate = ['this', 'is', 'ad', 'test'] score = sentence_bleu(reference, candidate) print(score) I am using this code to calculate the BLEU score and…
2
votes
0 answers

Best Smoothing Function to use in nltk corpus_bleu method

I'm trying to implement an Image Captioning model (CNN + LSTM) and as a validation metric I'm using the BLEU score. To be more precise, the corpus_bleu implementation of nltk. I tried using different SmoothingFunctions and I'm getting different…
Qwerty99
  • 29
  • 6
2
votes
2 answers

Calculating BLEU and Rouge score as fast as possible

I have around 200 candidate sentences and for each candidate, I want to measure the bleu score by comparing each sentence with thousands of reference sentences. These references are the same for all candidates. Here is how I'm doing it right…
mitra mirshafiee
  • 393
  • 6
  • 17
1
2 3 4