Tensorflow Precision / Recall / F1 score and Confusion matrix

Question

I would like to know if there is a way to implement the different score function from the scikit learn package like this one :

from sklearn.metrics import confusion_matrix
confusion_matrix(y_true, y_pred)

into a tensorflow model to get the different score.

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
init = tf.initialize_all_variables()
sess.run(init)
for epoch in xrange(1):
        avg_cost = 0.
        total_batch = len(train_arrays) / batch_size
        for batch in range(total_batch):
                train_step.run(feed_dict = {x: train_arrays, y: train_labels})
                avg_cost += sess.run(cost, feed_dict={x: train_arrays, y: train_labels})/total_batch
        if epoch % display_step == 0:
                print "Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(avg_cost)

print "Optimization Finished!"
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
# Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
print "Accuracy:", batch, accuracy.eval({x: test_arrays, y: test_labels})

Will i have to run the session again to get the prediction ?

instead of "accuracy.eval", you can do "session.run([accuracy, prediction], feed_dict=...), that will get both tensors at the same time. See http://stackoverflow.com/questions/33610685/in-tensorflow-what-is-the-difference-between-session-run-and-tensor-eval — Yaroslav Bulatov, Feb 12 '16 at 18:12
I understand your comment but how do i implement this with sklearn ? Because in the confusion matrix case, i don't want the accuracy ! — nicolasdavid, Feb 16 '16 at 15:18
But how can we draw a confusion matrix from tensorflow (correct_prediction and y_Test(truth labels)) as i have alrady asked it here,..http://stackoverflow.com/questions/35792969/how-to-calculate-precision-and-recall-from-an-incomplete-confusion-matrix.. Please help — Nomiluks, Mar 04 '16 at 10:06
This Question also similar to this one with more detailed solution: http://stackoverflow.com/questions/35756710/how-do-i-create-confusion-matrix-of-predicted-and-ground-truth-labels-with-tenso/35876136#35876136 — Nomiluks, Mar 08 '16 at 19:20

score 61 · Accepted Answer · edited Jan 29 '19 at 18:38

61

You do not really need sklearn to calculate precision/recall/f1 score. You can easily express them in TF-ish way by looking at the formulas:

Now if you have your actual and predicted values as vectors of 0/1, you can calculate TP, TN, FP, FN using tf.count_nonzero:

TP = tf.count_nonzero(predicted * actual)
TN = tf.count_nonzero((predicted - 1) * (actual - 1))
FP = tf.count_nonzero(predicted * (actual - 1))
FN = tf.count_nonzero((predicted - 1) * actual)

Now your metrics are easy to calculate:

precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * precision * recall / (precision + recall)

edited Jan 29 '19 at 18:38

keramat

4,328
6
25
38

answered May 14 '17 at 05:39

Salvador Dali

214,103
147
703
753

1

When computing precision by `precision = TP / (TP + FP)`, I find that precision always results in 0, as it seems it does integer division. Using `precision = tf.divide(TP, TP + FP)` worked for me, though. Similar for recall. – Harish Rajagopal Oct 20 '17 at 14:24
1

@Salvador, when you say `values as vectors of 0/1`, are you saying that the values are onehot encodings? e.g. `predicted = [0, 1] actual = [1, 0]` is a false positive for binary case – Brian Oct 30 '18 at 21:43
3

In TF v2.x, the corresponding functions are `tf.math.count_nonzero` and `tf.math.divide`. – user650654 Mar 19 '20 at 18:35

score 32 · Answer 2 · answered Mar 04 '16 at 11:42

Maybe this example will speak to you :

    pred = multilayer_perceptron(x, weights, biases)
    correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

    with tf.Session() as sess:
    init = tf.initialize_all_variables()
    sess.run(init)
    for epoch in xrange(150):
            for i in xrange(total_batch):
                    train_step.run(feed_dict = {x: train_arrays, y: train_labels})
                    avg_cost += sess.run(cost, feed_dict={x: train_arrays, y: train_labels})/total_batch         
            if epoch % display_step == 0:
                    print "Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(avg_cost)

    #metrics
    y_p = tf.argmax(pred, 1)
    val_accuracy, y_pred = sess.run([accuracy, y_p], feed_dict={x:test_arrays, y:test_label})

    print "validation accuracy:", val_accuracy
    y_true = np.argmax(test_label,1)
    print "Precision", sk.metrics.precision_score(y_true, y_pred)
    print "Recall", sk.metrics.recall_score(y_true, y_pred)
    print "f1_score", sk.metrics.f1_score(y_true, y_pred)
    print "confusion_matrix"
    print sk.metrics.confusion_matrix(y_true, y_pred)
    fpr, tpr, tresholds = sk.metrics.roc_curve(y_true, y_pred)

can you update and explain what `test_arrays` and `train_arrays` are? Because it looks like you are either accumulating the results for all of the batches in a given epoch, or you are just calculating the confusion for the results of a single batch, in which case you'd still have to accumulate the results of all the batches for a confusion w.r.t. the whole test epoch in an array outside of tensorflow. — Chris, Apr 30 '16 at 17:44
@nicolasdavid I tried your solution, but I get this error `ValueError: Target is multiclass but average='binary'. Please choose another average setting`. my `y_pred` and `y_true` are both 1d array-like as spacification of the method require. Any suggestion ? — Kyrol, Apr 12 '17 at 08:52
I think it's better to use metrics APIs provided in `tf.contrib.metrics` instead of mixing metrics functions provided by scikit-learn with tensorflow. — Nandeesh, Jun 30 '17 at 15:09

score 17 · Answer 3 · edited Jun 20 '20 at 09:12

Multi-label case

Previous answers do not specify how to handle the multi-label case so here is such a version implementing three types of multi-label f1 score in tensorflow: micro, macro and weighted (as per scikit-learn)

Update (06/06/18): I wrote a blog post about how to compute the streaming multilabel f1 score in case it helps anyone (it's a longer process, don't want to overload this answer)

f1s = [0, 0, 0]

y_true = tf.cast(y_true, tf.float64)
y_pred = tf.cast(y_pred, tf.float64)

for i, axis in enumerate([None, 0]):
    TP = tf.count_nonzero(y_pred * y_true, axis=axis)
    FP = tf.count_nonzero(y_pred * (y_true - 1), axis=axis)
    FN = tf.count_nonzero((y_pred - 1) * y_true, axis=axis)

    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    f1 = 2 * precision * recall / (precision + recall)

    f1s[i] = tf.reduce_mean(f1)

weights = tf.reduce_sum(y_true, axis=0)
weights /= tf.reduce_sum(weights)

f1s[2] = tf.reduce_sum(f1 * weights)

micro, macro, weighted = f1s

Correctness

def tf_f1_score(y_true, y_pred):
    """Computes 3 different f1 scores, micro macro
    weighted.
    micro: f1 score accross the classes, as 1
    macro: mean of f1 scores per class
    weighted: weighted average of f1 scores per class,
            weighted from the support of each class


    Args:
        y_true (Tensor): labels, with shape (batch, num_classes)
        y_pred (Tensor): model's predictions, same shape as y_true

    Returns:
        tuple(Tensor): (micro, macro, weighted)
                    tuple of the computed f1 scores
    """

    f1s = [0, 0, 0]

    y_true = tf.cast(y_true, tf.float64)
    y_pred = tf.cast(y_pred, tf.float64)

    for i, axis in enumerate([None, 0]):
        TP = tf.count_nonzero(y_pred * y_true, axis=axis)
        FP = tf.count_nonzero(y_pred * (y_true - 1), axis=axis)
        FN = tf.count_nonzero((y_pred - 1) * y_true, axis=axis)

        precision = TP / (TP + FP)
        recall = TP / (TP + FN)
        f1 = 2 * precision * recall / (precision + recall)

        f1s[i] = tf.reduce_mean(f1)

    weights = tf.reduce_sum(y_true, axis=0)
    weights /= tf.reduce_sum(weights)

    f1s[2] = tf.reduce_sum(f1 * weights)

    micro, macro, weighted = f1s
    return micro, macro, weighted


def compare(nb, dims):
    labels = (np.random.randn(nb, dims) > 0.5).astype(int)
    predictions = (np.random.randn(nb, dims) > 0.5).astype(int)

    stime = time()
    mic = f1_score(labels, predictions, average='micro')
    mac = f1_score(labels, predictions, average='macro')
    wei = f1_score(labels, predictions, average='weighted')

    print('sklearn in {:.4f}:\n    micro: {:.8f}\n    macro: {:.8f}\n    weighted: {:.8f}'.format(
        time() - stime, mic, mac, wei
    ))

    gtime = time()
    tf.reset_default_graph()
    y_true = tf.Variable(labels)
    y_pred = tf.Variable(predictions)
    micro, macro, weighted = tf_f1_score(y_true, y_pred)
    with tf.Session() as sess:
        tf.global_variables_initializer().run(session=sess)
        stime = time()
        mic, mac, wei = sess.run([micro, macro, weighted])
        print('tensorflow in {:.4f} ({:.4f} with graph time):\n    micro: {:.8f}\n    macro: {:.8f}\n    weighted: {:.8f}'.format(
            time() - stime, time()-gtime,  mic, mac, wei
        ))

compare(10 ** 6, 10)

outputs:

>> rows: 10^6 dimensions: 10
sklearn in 2.3939:
    micro: 0.30890287
    macro: 0.30890275
    weighted: 0.30890279
tensorflow in 0.2465 (3.3246 with graph time):
    micro: 0.30890287
    macro: 0.30890275
    weighted: 0.30890279

Posted the issue here: https://stackoverflow.com/questions/56425049/data-type-mismatch-in-streaming-f1-score-calculation-in-tensorflow — sherlock, Jun 03 '19 at 10:04
`tf.count_nonzero` was moved to `tf.math.count_nonzero` on TF v1.12 release. — tuomastik, Oct 11 '21 at 05:15

score 4 · Answer 4 · edited Dec 07 '17 at 19:26

4

Use the metrics APIs provided in tf.contrib.metrics, for example:

labels = ...
predictions = ...

accuracy, update_op_acc = tf.contrib.metrics.streaming_accuracy(labels, predictions)
error, update_op_error = tf.contrib.metrics.streaming_mean_absolute_error(labels, predictions)

sess.run(tf.local_variables_initializer())
for batch in range(num_batches):
  sess.run([update_op_acc, update_op_error])
accuracy, mean_absolute_error = sess.run([accuracy, mean_absolute_error])

edited Dec 07 '17 at 19:26

Jay Patel

37
7

answered Jun 30 '17 at 15:06

Nandeesh

2,683
2
30
42

1

Note that these are cumulative results which might be confusing. – John Wu Dec 04 '17 at 04:46

score 4 · Answer 5 · edited Jul 18 '17 at 11:35

Since i have not enough reputation to add a comment to Salvador Dalis answer this is the way to go:

tf.count_nonzero casts your values into an tf.int64 unless specified otherwise. Using:

argmax_prediction = tf.argmax(prediction, 1)
argmax_y = tf.argmax(y, 1)

TP = tf.count_nonzero(argmax_prediction * argmax_y, dtype=tf.float32)
TN = tf.count_nonzero((argmax_prediction - 1) * (argmax_y - 1), dtype=tf.float32)
FP = tf.count_nonzero(argmax_prediction * (argmax_y - 1), dtype=tf.float32)
FN = tf.count_nonzero((argmax_prediction - 1) * argmax_y, dtype=tf.float32)

is a realy good idea.

argmax returns indices, so it seems that these wont work? – Rikku Porta Sep 07 '17 at 13:55 — Rikku Porta, Sep 07 '17 at 13:55

Tensorflow Precision / Recall / F1 score and Confusion matrix

5 Answers5

Multi-label case

Correctness

Linked