0

I am learning the code that differentiates between spam and non-spam emails. I have done the part of the training data. When dealing with the testing of the data, I had to compare the prediction and test data arrays, I ran into an error, so I built two different codes. But both these codes are yielding different outputs. Could anyone help me know which code is better and more accurate, and is there any other simple way.

The error states:

DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
  correct_docs = (y_test==prediction)

I tried the following code:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#file address

TOKEN_SPAM_PROB_FILE="SpamData/03_Testing/prob-spam.txt"
TOKEN_NONSPAM_PROB_FILE="SpamData/03_Testing/prob-nonspam.txt"
TOKEN_ALL_PROB_FILE="SpamData/03_Testing/prob-all-tokens.txt"

TEST_FEATURE_MATRIX="SpamData/03_Testing/test-features.txt"
TEST_TARGET_FILE="SpamData/03_Testing/test-target.txt"



VOCAB_SIZE=2500

#features
x_test=np.loadtxt(TEST_FEATURE_MATRIX, delimiter=" ")
#target
y_test=np.loadtxt(TEST_TARGET_FILE, delimiter=" ")
#token probabilitis
prob_token_spam=np.loadtxt(TOKEN_SPAM_PROB_FILE, delimiter=" ")
prob_token_nonspam=np.loadtxt(TOKEN_NONSPAM_PROB_FILE, delimiter=" ")
prob_all_token=np.loadtxt(TOKEN_ALL_PROB_FILE, delimiter=" ")

PROB_SPAM=0.3116

joint_log_spam=x_test.dot(np.log(prob_token_spam) - np.log(prob_all_token)) + np.log(PROB_SPAM)

joint_log_nonspam=x_test.dot(np.log(prob_token_nonspam) - np.log(prob_all_token)) + np.log(1-PROB_SPAM)


prediction=joint_log_spam > joint_log_nonspam

#simplification

joint_log_spam=x_test.dot(np.log(prob_token_spam)) + np.log(PROB_SPAM)

joint_log_nonspam=x_test.dot(np.log(prob_token_nonspam)) + np.log(r_1-PROB_SPAM)

#number of correct documents

correct_docs = (y_test==prediction)

# I want to use the following sum command as well

correct_docs = (y_test==prediction).sum()

Then I used the following two codes, but got different outputs

#Code 1

#numnber of correct documents

correct_docs=y_test[:len(prediction)]==prediction[:len(prediction)]

print("Length of correct_docs is:", len(correct_docs))

print("Docs Classified correctly are:", correct_docs)

numbdocs_wrong=x_test.shape[0]-correct_docs

print("Docs classified incorrectly are:", numbdocs_wrong)

Code 2

#Code 2

#numnber of correct documents

nr_correct_doc=[np.where(y_test==x)[0][0] for x in prediction]
# print(correct_doc)

total=0
for i in correct_doc:
    if i!=0:
        total+=1
# np.digitize(y_test, prediction)
print(total)
correct_doc_total=total

correct_docs=correct_doc_total

print("Docs Classified correctly are:", correct_docs)
numbdocs_wrong=x_test.shape[0]-correct_docs
print("Docs classified incorrectly are:", numbdocs_wrong)

The link of all the folder of all the files is: https://drive.google.com/drive/folders/15M7-VcUZw7gkLWxlJ8MDKLm6muYIREoT?usp=share_link

0 Answers0