Getting dimension mismatch error when i try to predict with naive bayes / Python

Question

I've created a sentiment script and use Naive Bayes to classify the reviews. I trained and tested my model and saved it in a Pickle object. Now I would like to perform on a new dataset my prediction but I always get following error message

raise ValueError('dimension mismatch') ValueError: dimension mismatch

It pops up on this line:

preds = nb.predict(transformed_review)[0]

Can anyone tell me if I'm doing something wrong? I do not understand the error.

This is my Skript:

sno = SnowballStemmer("german")
stopwords = [word.decode('utf-8-sig') for word in stopwords.words('german')] 

ta_review_files = glob.glob('C:/users/Documents/review?*.CSV')
review_akt_doc = max(ta_review_files, key=os.path.getctime

ta_review = pd.read_csv(review_akt_doc) 
sentiment_de_class= ta_review

x = sentiment_de_class['REV']
y = sentiment_de_class['SENTIMENT']

def text_process(text):
    nopunc = [char for char in text.decode('utf8') if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    noDig = ''.join(filter(lambda x: not x.isdigit(), nopunc)) 

    ## stemming
    stemmi = u''.join(sno.stem(unicode(x)) for x in noDig)

    stop = [word for word in stemmi.split() if word.lower() not in stopwords]
    stop = ' '.join(stop)

    return [word for word in stemmi.split() if word.lower() not in stopwords]


######################
# Matrix
######################
bow_transformer = CountVectorizer(analyzer=text_process).fit(x)
x = bow_transformer.transform(x)

######################
# Train and test data
######################
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=101)


print 'starting training ..'

######################
## first use
######################
#nb = MultinomialNB().fit(x_train,y_train)
#file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
## dump information to that file
#pickle.dump(nb, file)

######################
## after train
######################
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
nb = pickle.load(file)

predis = []
######################
# Classify 
######################
cols = ['SENTIMENT_CLASSIFY']

for sentiment in sentiment_de_class['REV']:
    transformed_review = bow_transformer.transform([sentiment])
    preds = nb.predict(transformed_review)[0]  ##right here I get the error
    predis.append(preds)

df = pd.DataFrame(predis, columns=cols)

What dimensionality (shape) does transformed review have? You can make certain you're passing in [n_samples, n_features] to the naive bayes. Ref: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict — Ari K, Apr 24 '18 at 18:10
The transformed review has this Output (written among each other): (0, 8) 1 (0, 11) 1 (0, 26) 1 (0, 39) 1 — Nika, Apr 24 '18 at 18:13
what is the result of `x_train.shape` and `transformed_review.shape`? — Mohammad Athar, Apr 24 '18 at 18:16
x_train.shape (5, 129) and transformed_review.shape (1, 129) — Nika, Apr 24 '18 at 18:20
Try not transforming. Unless you have 1 sample it sounds like transformed_review is not the right shape for the naive bayes — Ari K, Apr 24 '18 at 18:28
if I use preds = nb.predict(sentiment), then i get a totaly different error — Nika, Apr 24 '18 at 18:37
i just retrained the model with the actual dataset and my looped worked. I have the same shape information. Does anyone knows, what this could mean? Seems like I have to retrain my Model, to make it work — Nika, Apr 24 '18 at 18:49
@VivekKumar sry for the late reply, yes it works, thanks alot! — Nika, May 02 '18 at 11:23

Vivek Kumar · Accepted Answer · 2018-04-25T07:02:01.783

You need to save the CountVectorizer object too just as you are saving the nb.

When you call

CountVectorizer(analyzer=text_process).fit(x)

you are re-training the CountVectorizer on new data, so the features (vocabulary) found by it will be different than at the training time and hence the saved nb which was trained on the earlier features complain about dimension mismatch.

Better to pickle them in different files, but if you want you can save them in same file.

To pickle both in same object:

file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
pickle.dump(bow_transformer, file)  <=== Add this
pickle.dump(nb, file)

To read both in next call:

file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
bow_transformer = pickle.load(file)
nb = pickle.load(file)

Please look at this answer for more detail: https://stackoverflow.com/a/15463472/3374996

Do I pickle the Vector in a seperate file or do I do this in sentiment_MNB_model.pickle ? — Nika, Apr 25 '18 at 06:48

Getting dimension mismatch error when i try to predict with naive bayes / Python

1 Answers1