I am very much new to machine learning. I have a problem to solve using supervised machine learning;
Problem: Learn from the training data and understand the labels (I have got training data in .csv formet where column1 is data and column2 is corresponding label, also my data is email of users and label is the category in which I want classify) and later on when given new data to test classify them to one of the label that you have used while training; and we wanted to know the per classification weightage so that we should be confident that the classification done is accurate.
Here is the code i am trying:
import random
import pandas as pd
import nltk
def clean_data(data):
data = str(data).replace('\n', '').replace('\r', '').replace('\r\n', '').replace('\'', '').replace('\\', '')
return data
def data_features(word):
return {'test_data': word}
def clean_data_feature(word):
return data_features(clean_data(word))
def classifydata(filename, datacolumn, labelcolumn):
df = pd.read_csv(filename, encoding='latin1', index_col=None, dtype={datacolumn: str})
subset = df[[datacolumn, labelcolumn]]
labeled_names = [tuple(x) for x in subset.values]
random.shuffle(labeled_names)
featuresets = [(clean_data_feature(n), label) for (n, label) in labeled_names]
train_set, test_set = featuresets, featuresets
classifier = nltk.NaiveBayesClassifier.train(train_set)
df = pd.read_csv('D:/ML/Event_Data_601-700.csv', encoding='latin1', index_col=None, dtype={'mMsgContent': str})
for data in df['mMsgContent']:
print(classifier.classify(clean_data_feature(data)))
classifydata('D:/ML/Event_Data_Training_600.csv', 'mMsgContent', 'call related to')
This prints the classification done based on learning, but we wanted to know that "How confident the classifier is (in terms of %) that the classification i did here per record is accurate by some percent.
Any help/suggestion/changing then way of writing this code appreciated; also let me know if i should add more details.