Creating confusion matrix from multiple .csv files

Question

I have a lot of .csv files with the following format.

From column 1, I wanted to read current row and compare it with the value of the previous row. If it is greater OR equal, continue comparing and if the value of the current cell is smaller than the previous row - then i divide the current value with the previous value and proceed. For example in the table given above: the smaller value we will get depending on my requirement from Column 1 is 327 (because 327 is smaller than the previous value 340) - and then we divide 327 by 340 and we get the value 0.96. My python script should exit right after we print the criteria (A) as given below.

from __future__ import division
import csv

def category(val):
    if 0.8 < val <= 0.9:
        return "A"
    if abs(val - 0.7) < 1e-10:
        return "B"
    if 0.5 < val < 0.7:
        return "C"
    if abs(val - 0.5) < 1e-10:
        return "E"
    return "D"

    with open("test.csv", "r") as csvfile:
    ff = csv.reader(csvfile)

    results = []
    previous_value = 0
    for col1, col2 in ff:
        if not col1.isdigit():
            continue
        value = int(col1)
        if value >= previous_value:
            previous_value = value
            continue
        else:
            result =  int(col1)/ int(previous_value)
            results.append(result)
            print category(result)
            previous_value = value
    print (results)
    print (sum(results))
    print (category(sum(results) / len(results)))

Finally, i want to run my scrip for all the .csv files i have in the current directory and build a confusion matrix like the following. Let's say A1.csv, A2.csv, A3.csv are supposed (or predicted) to print A, B1.csv, B2.csv, B3.csv are supposed (or predicted) to print B and C1.csv, C2.csv and C3.csv are supposed (or predicted) to print C, ... etc. How can we automatically create a confusion matrix from multiple .csv files for example like the following using Python?

As it is shown below, the colored blocks of the matrix (row-labels) will show us the number of counts of A (count of true values for A), B (count of true values for b) and C (count of true values for C), ..etc from the control logic of our function category()- given above. The column labels from the control logic we have inside the if-else statement (A, B, C, D and E).

Are the dimension of the matrix knowing before? – stovfl May 28 '17 at 11:09 — stovfl, May 28 '17 at 11:09

stovfl · Answer 1 · 2017-06-04T19:04:45.787

3

Add a def get_predict(filename)

def get_predict(filename):
    if 'Alex' in filename:
        return 'Alexander'
    else:
        return filename [0]

Reading n files, compute confusion matrix using pandas crosstab:

import os
import pandas as pd

def get_category(filepath):
    def category(val):
        print('predict({}; abs({})'.format(val, abs(val)))
        if 0.8 < val <= 0.9:
            return "A"
        if abs(val - 0.7) < 1e-10:
            return "B"
        if 0.5 < val < 0.7:
            return "C"
        if abs(val - 0.5) < 1e-10:
            return "E"
        return "D"

    with open(filepath, "r") as csvfile:
        ff = csv.reader(csvfile)

        results = []
        previous_value = 0
        for col1, col2 in ff:
            value = int(col1)
            if value >= previous_value:
                previous_value = value
            else:
                results.append(value / previous_value)
                previous_value = value

    return category(sum(results) / len(results))

matrix = {'actual':[], 'predict':[]}
path = 'test/confusion'
for filename in os.listdir( path ):
    # The first Char in filename is Predict Key
    matrix['predict'].append(filename[0])
    matrix['actual'].append(get_category(os.path.join(path, filename)))

df = pd.crosstab(pd.Series(matrix['actual'], name='Actual'),
                 pd.Series(matrix['predict'], name='Predicted')
                 )
print(df)

Output: (Reading "A.csv, B.csv, C.csv" with the given example Data three times)
Predicted  A  B  C
Actual            
A          3  0  0
B          0  3  0
C          0  0  3

Tested with Python:3.4.2 - pandas:0.19.2

edited Jun 04 '17 at 19:04

answered Jun 01 '17 at 21:59

stovfl

14,998
7
24
51

Very good. Can we the same word in predicted (instead of a single character) as in actual? – Mahsolid Jun 04 '17 at 13:08
@Mahsolid: `def category(..` returns a Char `A|B|C...`, therfore `predict` have to be the same. Give a example what you mean with _**same word**_. – stovfl Jun 04 '17 at 13:28
For example let's say `if 0.8 < val <= 0.9: return "Alexander"`. This should give `Alexander` on both `Actual` and `Predicted` columns. This is what I meant dear. – Mahsolid Jun 04 '17 at 18:01
@Mahsolid: As long as your `filename` e.g. starts with `Alexander` this should also work. You have to edit the code to react to this. – stovfl Jun 04 '17 at 18:15
Excellent and Tnx. I will fix it now. – Mahsolid Jun 04 '17 at 19:33
How about if we want to print the accuracy. I tried it with `print(accuracy_score('actual', 'predict'))` but it is giving me this error: `ValueError: Found input variables with inconsistent numbers of samples: [6, 7]` – Mahsolid Jun 05 '17 at 09:23
how can we display the accuracy dear? – Mahsolid Jun 06 '17 at 06:09

score 1 · Answer 2 · answered Jun 04 '17 at 12:07

Using Scikit-Learn is the best option to go for in your case as it provides a confusion_matrix function. Here is an approach you can easily extend.

from sklearn.metrics import confusion_matrix

# Read your csv files
with open('A1.csv', 'r') as readFile:
    true_values = [int(ff) for ff in readFile]
with open('B1.csv', 'r') as readFile:
    predictions = [int(ff) for ff in readFile]

# Produce the confusion matrix
confusionMatrix = confusion_matrix(true_values, predictions)

print(confusionMatrix)

This is the output you would expect.

[[0 2]
 [0 2]]

For more hint - check out the following link:

How to write a confusion matrix in Python?

Creating confusion matrix from multiple .csv files

2 Answers2

Linked