1

I am using the following code to predict output for an SMS text using Naive Bayes

from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(X,Y)
X_test = np.array(['This is a sample sms'], dtype=object)

X_test_transformed = vec.transform(X_test)

X_test = X_transformed.toarray()

proba=mnb.predict_proba(X_test)
print(proba)

I train the model using fit function on X, Y. And now I want to predict if the
SMS This is a sample sms is spam or not. I am not sure what I am doing wrong
Because the last line should give me a probability. But it gives me the following output

enter image description here

 [[9.99999987e-01 1.30424974e-08]
 [9.99996703e-01 3.29712871e-06]
 [1.15232279e-22 1.00000000e+00]
 ...
 [9.62666043e-01 3.73339566e-02]
 [9.99984562e-01 1.54382674e-05]
 [9.66244280e-01 3.37557203e-02]]
Juan
  • 93
  • 7
  • The probability is 1.0, since the test data has been observed. – wildplasser Aug 01 '21 at 15:37
  • It's not been observed right. I have created a new `X_test ` which is new data. Even if it's 0 I am not sure why it's returning that 2 dimensional matrix and not a single value – Juan Aug 01 '21 at 15:56
  • I saw something similar in another post https://stackoverflow.com/questions/36681449/scikit-learn-return-value-of-logisticregression-predict-proba It explains somewhat but still not entirely clear.. Do I have to add up the value in the second column? – Juan Aug 01 '21 at 16:57
  • Note: that was a Bayesian joke. – wildplasser Aug 02 '21 at 00:16

1 Answers1

0

Notice that for each row these two numbers add up to 1. For the first row:

9.99999987e-01 = 9.99999987 * 0.1 = 0.999999987

1.30424974e-08 = 1.30424974 * 0.00000001 = 0.000000013

So the predicted probability of this sms for class A (this could be either spam or ham, depending on the rest of the code) is 0.999... and the probability of this sms for class B is 0.00....1

So basically NB predicted class A there with a close to 1 probability. If for example the output was 0.6 , 0.4 (one row of your output matrix) then you would know that NB predicted class A with a 0.6 probability and class B with 0.4 probability. This additional info can be used to threshold your predictions for example.

Edit: If you don't want this score replace .predict_proba with .predict

Gaussian Prior
  • 756
  • 6
  • 16
  • But `proba=mnb.predict_proba(X_test)` this line why it's returning multiple tuples? There are only two classes ham or spam so ideally, it should have returned something like `[9.99999987e-01 1.30424974e-08]`. I noticed the number of rows in the response is the amount of records in the dataset... But what could be the reason behind this. – Juan Aug 02 '21 at 10:20
  • So the first output `[9.99999987e-01 1.30424974e-08]` means that probability that the given SMS belongs to Class A = 9.99999987e-01 and Class B = 1.30424974e-08 . what do the next row `[9.99996703e-01 3.29712871e-06]` means? – Juan Aug 02 '21 at 10:21
  • Yeah exactly. The second row implies that the second element of X_test (X_test is a set with every sms which is used to evaluate your model) belongs to class A with probability 9.99996703e-01 and to class B with probability 3.29712871e-06. Similarly, the n-th row is the predicted probabilities for each class for the n-th element of your test set. Keep in mind as you correctly pointed out, you need an answer for every element of your test set, thus if X_test had 100 rows then you would get 100 pairs of numbers in the form of probabilities assigned to predictions. This is normal. – Gaussian Prior Aug 02 '21 at 15:08