sklearn: Found arrays with inconsistent numbers of samples when calling naive_bayes.MultinomialNB(

Question

I have looked at similar questions as such as this one. But none of the mentioned solutions worked in my case.

I am trying to build a text classification prediction model.

def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)

    if is_neural_net:
        predictions = predictions.argmax(axis=-1)

    return metrics.accuracy_score(predictions, train_label)

# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(),train_text,train_label,test_text)

print ("NB, WordLevel TF-IDF: ", accuracy)

However, Naive_bayes returns the below error:

ValueError: Found input variables with inconsistent numbers of samples: [500, 3100]

my training data

print(train_text.shape)
type(train_text)

returns

(3100, 3013)
scipy.sparse.csr.csr_matrix

my training labels

print(train_label.shape)
type(train_label)

returns

(3100,)
numpy.ndarray

my test dataset

print(test_text.shape)
type(test_text)

returns

(500, 3013)
scipy.sparse.csr.csr_matrix

I tried every possible type of transformation. Can any one recommend a solution? thanks

score 0 · Answer 1 · answered May 22 '20 at 02:36

0

I guess the problem is in

predictions = classifier.predict(feature_vector_valid)
return metrics.accuracy_score(predictions, train_label)

What is the shape of predictions? Is train_label a global variable in train_model? Also, is prediction has the same shape as train_label?

answered May 22 '20 at 02:36

wong.lok.yin

849
1
5
10

The shape of prediction is `(500,)`. – leena May 22 '20 at 03:58
Yeah, that's the problem. The shape of `train_label` is `(3100,)`. How do you calculate the accuracy of `predictions` and `train_label` when they have different shape? – wong.lok.yin May 22 '20 at 04:08
What do you suggest? – leena May 22 '20 at 05:12
You can either `predictions = classifier.predict(feature_vector_train)` or `metrics.accuracy_score(predictions, valid_label)` if you have validation set has label. – wong.lok.yin May 22 '20 at 06:01
It worked when I replaced feature_vector_valid with feature_vector_train. But is is the right approach? because I am following this guide https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/ and that is where I got my code from. Can I see the predicted output? Thank you for your help. – leena May 22 '20 at 07:29
Because the predictions should be made based on the test data, not the training data. – leena May 22 '20 at 07:30
Oh I see. Of course you should not make prediction based on train data. But I see you copy the wrong code. It should be `metrics.accuracy_score(predictions, valid_y)`, but you wrote `metrics.accuracy_score(predictions, train_label)`. – wong.lok.yin May 22 '20 at 07:36
because this is the name of my train_label. It is different from the one in the link. – leena May 22 '20 at 07:38
No, that's not how accuracy works. I think you misundestand accuracy. Accuracy only works when two arguments have the same shape. That's why you should compare `prediction` (shape 500) with `valid_y` (shape 500) instead of `train_label` (shape 3100). `metrics.accuracy_score(predictions, valid_y)` is correct. `metrics.accuracy_score(predictions, train_label)` is just wrong. – wong.lok.yin May 22 '20 at 08:17
I don't know what is the corresponding value to valid_y in my data. – leena May 22 '20 at 08:38
`valid_y` is just your test label. You have `train_text,train_label,test_text`, so I guess you have something like `test_label`? If you have it, try `metrics.accuracy_score(predictions, test_label)` – wong.lok.yin May 22 '20 at 08:51
I don't have it because this is a prediction model. So `train_text` is 3100 tweets. `train_label` is 3100 gender of the users. `train_text` is 500 tweets and the model is supposed to predict 500 genders. – leena May 22 '20 at 08:58
The purpose of accuracy score is to compare the prediction with the true label, to see how many correct prediction. But if you only have prediction but not the true label, you cannot use accuracy sore. – wong.lok.yin May 22 '20 at 09:11

sklearn: Found arrays with inconsistent numbers of samples when calling naive_bayes.MultinomialNB(

1 Answers1