I am learning the code that differentiates between spam and non-spam emails. I have done the part of the training data. When dealing with the testing of the data, I had to compare the prediction and test data arrays, I ran into an error, so I built two different codes. But both these codes are yielding different outputs. Could anyone help me know which code is better and more accurate, and is there any other simple way.
The error states:
DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
correct_docs = (y_test==prediction)
I tried the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#file address
TOKEN_SPAM_PROB_FILE="SpamData/03_Testing/prob-spam.txt"
TOKEN_NONSPAM_PROB_FILE="SpamData/03_Testing/prob-nonspam.txt"
TOKEN_ALL_PROB_FILE="SpamData/03_Testing/prob-all-tokens.txt"
TEST_FEATURE_MATRIX="SpamData/03_Testing/test-features.txt"
TEST_TARGET_FILE="SpamData/03_Testing/test-target.txt"
VOCAB_SIZE=2500
#features
x_test=np.loadtxt(TEST_FEATURE_MATRIX, delimiter=" ")
#target
y_test=np.loadtxt(TEST_TARGET_FILE, delimiter=" ")
#token probabilitis
prob_token_spam=np.loadtxt(TOKEN_SPAM_PROB_FILE, delimiter=" ")
prob_token_nonspam=np.loadtxt(TOKEN_NONSPAM_PROB_FILE, delimiter=" ")
prob_all_token=np.loadtxt(TOKEN_ALL_PROB_FILE, delimiter=" ")
PROB_SPAM=0.3116
joint_log_spam=x_test.dot(np.log(prob_token_spam) - np.log(prob_all_token)) + np.log(PROB_SPAM)
joint_log_nonspam=x_test.dot(np.log(prob_token_nonspam) - np.log(prob_all_token)) + np.log(1-PROB_SPAM)
prediction=joint_log_spam > joint_log_nonspam
#simplification
joint_log_spam=x_test.dot(np.log(prob_token_spam)) + np.log(PROB_SPAM)
joint_log_nonspam=x_test.dot(np.log(prob_token_nonspam)) + np.log(r_1-PROB_SPAM)
#number of correct documents
correct_docs = (y_test==prediction)
# I want to use the following sum command as well
correct_docs = (y_test==prediction).sum()
Then I used the following two codes, but got different outputs
#Code 1
#numnber of correct documents
correct_docs=y_test[:len(prediction)]==prediction[:len(prediction)]
print("Length of correct_docs is:", len(correct_docs))
print("Docs Classified correctly are:", correct_docs)
numbdocs_wrong=x_test.shape[0]-correct_docs
print("Docs classified incorrectly are:", numbdocs_wrong)
Code 2
#Code 2
#numnber of correct documents
nr_correct_doc=[np.where(y_test==x)[0][0] for x in prediction]
# print(correct_doc)
total=0
for i in correct_doc:
if i!=0:
total+=1
# np.digitize(y_test, prediction)
print(total)
correct_doc_total=total
correct_docs=correct_doc_total
print("Docs Classified correctly are:", correct_docs)
numbdocs_wrong=x_test.shape[0]-correct_docs
print("Docs classified incorrectly are:", numbdocs_wrong)
The link of all the folder of all the files is: https://drive.google.com/drive/folders/15M7-VcUZw7gkLWxlJ8MDKLm6muYIREoT?usp=share_link