0

When running the following code, I ran into the problem: All arrays must be of same length. I am trying to generate a pandas dataframe so that I can plot it using seaborn lmplot, but it's not working.

I tried this:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#file address

TOKEN_SPAM_PROB_FILE="SpamData/03_Testing/prob-spam.txt"
TOKEN_NONSPAM_PROB_FILE="SpamData/03_Testing/prob-nonspam.txt"
TOKEN_ALL_PROB_FILE="SpamData/03_Testing/prob-all-tokens.txt"

TEST_FEATURE_MATRIX="SpamData/03_Testing/test-features.txt"
TEST_TARGET_FILE="SpamData/03_Testing/test-target.txt"



VOCAB_SIZE=2500

#features
x_test=np.loadtxt(TEST_FEATURE_MATRIX, delimiter=" ")
#target
y_test=np.loadtxt(TEST_TARGET_FILE, delimiter=" ")
#token probabilitis
prob_token_spam=np.loadtxt(TOKEN_SPAM_PROB_FILE, delimiter=" ")
prob_token_nonspam=np.loadtxt(TOKEN_NONSPAM_PROB_FILE, delimiter=" ")
prob_all_token=np.loadtxt(TOKEN_ALL_PROB_FILE, delimiter=" ")

PROB_SPAM=0.3116

joint_log_spam=x_test.dot(np.log(prob_token_spam) - np.log(prob_all_token)) + np.log(PROB_SPAM)

joint_log_nonspam=x_test.dot(np.log(prob_token_nonspam) - np.log(prob_all_token)) + np.log(1-PROB_SPAM)


prediction=joint_log_spam > joint_log_nonspam

#simplification

joint_log_spam=x_test.dot(np.log(prob_token_spam)) + np.log(PROB_SPAM)

joint_log_nonspam=x_test.dot(np.log(prob_token_nonspam)) + np.log(1-PROB_SPAM)

correct_doc=[np.where(y_test==x)[0][0] for x in prediction]
# print(correct_doc)
total=0
for i in correct_doc:
    if i!=0:
        total+=1
# np.digitize(y_test, prediction)
print(total)
correct_doc_total=total

correct_docs=correct_doc_total

print("Docs Classified correctly are:", correct_docs)
numbdocs_wrong=x_test.shape[0]-correct_docs
print("Docs classified incorrectly are:", numbdocs_wrong)

fraction_wrong = numbdocs_wrong/len(x_test)
print('Fraction classified incorrectly is {:.2%}'.format(fraction_wrong))
print('Accuracy of the model is {:.2%}'.format(1-fraction_wrong))

#Data Visualisation

yaxis_label = 'P(X | Spam)'
xaxis_label = 'P(X | Nonspam)'

linedata = np.linspace(start=-14000, stop=1, num=1000)

print("The shape of joint_log_spam is:", joint_log_spam.shape)
print("The shape of joint_log_nonspam is:", joint_log_nonspam.shape)
print("The shape of x_test is:", x_test.shape)

# Chart Styling
sns.set_style('whitegrid')
labels = 'Actual Category'

summary_df = pd.DataFrame({yaxis_label:joint_log_spam, xaxis_label:joint_log_nonspam, labels:y_test})

sns.lmplot(x=joint_log_nonspam, y=joint_log_spam, data=summary_df, size=6.5, fit_reg=False,
          scatter_kws={'alpha': 0.5, 's': 25})

plt.xlim([-2000, 1])
plt.ylim([-2000, 1])

plt.plot(linedata, linedata, color='black')

sns.plt.show()

The link of the path folder is given below:

https://drive.google.com/drive/folders/15M7-VcUZw7gkLWxlJ8MDKLm6muYIREoT?usp=sharing

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [23], line 1
----> 1 summary_df = pd.DataFrame({yaxis_label: joint_log_spam, xaxis_label: joint_log_nonspam, labels: y_test})

File ~\anaconda3\envs\py11\Lib\site-packages\pandas\core\frame.py:664, in DataFrame.__init__(self, data, index, columns, dtype, copy)
    658     mgr = self._init_mgr(
    659         data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy
    660     )
    662 elif isinstance(data, dict):
    663     # GH#38939 de facto copy defaults to False only in non-dict cases
--> 664     mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
    665 elif isinstance(data, ma.MaskedArray):
    666     import numpy.ma.mrecords as mrecords

File ~\anaconda3\envs\py11\Lib\site-packages\pandas\core\internals\construction.py:493, in dict_to_mgr(data, index, columns, dtype, typ, copy)
    489     else:
    490         # dtype check to exclude e.g. range objects, scalars
    491         arrays = [x.copy() if hasattr(x, "dtype") else x for x in arrays]
--> 493 return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)

File ~\anaconda3\envs\py11\Lib\site-packages\pandas\core\internals\construction.py:118, in arrays_to_mgr(arrays, columns, index, dtype, verify_integrity, typ, consolidate)
    115 if verify_integrity:
    116     # figure out the index, if necessary
    117     if index is None:
--> 118         index = _extract_index(arrays)
    119     else:
    120         index = ensure_index(index)

File ~\anaconda3\envs\py11\Lib\site-packages\pandas\core\internals\construction.py:666, in _extract_index(data)
    664 lengths = list(set(raw_lengths))
    665 if len(lengths) > 1:
--> 666     raise ValueError("All arrays must be of the same length")
    668 if have_dicts:
    669     raise ValueError(
    670         "Mixing dicts with non-Series may lead to ambiguous ordering."
    671     )

ValueError: All arrays must be of the same length
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
  • 1) `data = {yaxis_label: joint_log_spam, xaxis_label: joint_log_nonspam, labels: y_test}` 2) `summary_df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)` 3) `sns.lmplot(x=xaxis_label, y=yaxis_label, data=summary_df, fit_reg=True, scatter_kws={'alpha': 0.5, 's': 25})`. You passed incorrect variables to `x=` and `y=` of the plot call, and there is no `sns.plt.show()`, use `plt.show()`. [Completed Plot](https://i.stack.imgur.com/CG6lk.png). **Tested in `python 3.11`, `pandas 1.5.3`, `matplotlib 3.7.0`, `seaborn 0.12.2`** – Trenton McKinney Mar 10 '23 at 18:47
  • @TrentonMcKinney Sir Thank you for your guidance. But I have another question from this code as well , how can we compare y_test and prediction, i.e., `correct_docs=( y_test=prediction).sum()`. This code was attempted successfully by My tutor but it is not working for me. – Haseeb Ahmad Mar 11 '23 at 04:29
  • The error states that "DeprecationWarning: elementwise comparison failed; this will raise an error in the future." – Haseeb Ahmad Mar 11 '23 at 04:39
  • Similarly, the following comparison is stating "ValueError: operands could not be broadcast together with shapes (3620,) (1552,) ". The code is `true_pos= (y_test == 1) & (prediction == 1)` – Haseeb Ahmad Mar 11 '23 at 04:57
  • Each StackOverflow post should be a single question, and comments aren’t the place for tacking on more questions or posting answers. In this case, your question was answered by the duplicate so I posted code making it clear how the duplicate answered the question. Please post a new question if you need assistance with other aspects of the code. Make sure you are using current versions of pandas, seaborn, & matplotlib. – Trenton McKinney Mar 11 '23 at 15:19

0 Answers0