0

I've created a sentiment script and use Naive Bayes to classify the reviews. I trained and tested my model and saved it in a Pickle object. Now I would like to perform on a new dataset my prediction but I always get following error message

raise ValueError('dimension mismatch') ValueError: dimension mismatch

It pops up on this line:

preds = nb.predict(transformed_review)[0]

Can anyone tell me if I'm doing something wrong? I do not understand the error.

This is my Skript:

sno = SnowballStemmer("german")
stopwords = [word.decode('utf-8-sig') for word in stopwords.words('german')] 

ta_review_files = glob.glob('C:/users/Documents/review?*.CSV')
review_akt_doc = max(ta_review_files, key=os.path.getctime

ta_review = pd.read_csv(review_akt_doc) 
sentiment_de_class= ta_review

x = sentiment_de_class['REV']
y = sentiment_de_class['SENTIMENT']

def text_process(text):
    nopunc = [char for char in text.decode('utf8') if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    noDig = ''.join(filter(lambda x: not x.isdigit(), nopunc)) 

    ## stemming
    stemmi = u''.join(sno.stem(unicode(x)) for x in noDig)

    stop = [word for word in stemmi.split() if word.lower() not in stopwords]
    stop = ' '.join(stop)

    return [word for word in stemmi.split() if word.lower() not in stopwords]


######################
# Matrix
######################
bow_transformer = CountVectorizer(analyzer=text_process).fit(x)
x = bow_transformer.transform(x)

######################
# Train and test data
######################
x_train, x_test, y_train, y_test = train_test_split(x,y, random_state=101)


print 'starting training ..'

######################
## first use
######################
#nb = MultinomialNB().fit(x_train,y_train)
#file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
## dump information to that file
#pickle.dump(nb, file)

######################
## after train
######################
file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
nb = pickle.load(file)

predis = []
######################
# Classify 
######################
cols = ['SENTIMENT_CLASSIFY']

for sentiment in sentiment_de_class['REV']:
    transformed_review = bow_transformer.transform([sentiment])
    preds = nb.predict(transformed_review)[0]  ##right here I get the error
    predis.append(preds)

df = pd.DataFrame(predis, columns=cols)
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Nika
  • 145
  • 1
  • 13
  • What dimensionality (shape) does transformed review have? You can make certain you're passing in [n_samples, n_features] to the naive bayes. Ref: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict – Ari K Apr 24 '18 at 18:10
  • The transformed review has this Output (written among each other): (0, 8) 1 (0, 11) 1 (0, 26) 1 (0, 39) 1 – Nika Apr 24 '18 at 18:13
  • Ah, Is it a list? – Ari K Apr 24 '18 at 18:14
  • 1
    what is the result of `x_train.shape` and `transformed_review.shape`? – Mohammad Athar Apr 24 '18 at 18:16
  • x_train.shape (5, 129) and transformed_review.shape (1, 129) – Nika Apr 24 '18 at 18:20
  • Try not transforming. Unless you have 1 sample it sounds like transformed_review is not the right shape for the naive bayes – Ari K Apr 24 '18 at 18:28
  • if I use preds = nb.predict(sentiment), then i get a totaly different error – Nika Apr 24 '18 at 18:37
  • i just retrained the model with the actual dataset and my looped worked. I have the same shape information. Does anyone knows, what this could mean? Seems like I have to retrain my Model, to make it work – Nika Apr 24 '18 at 18:49
  • Is your problem solved? – Vivek Kumar May 02 '18 at 09:58
  • @VivekKumar sry for the late reply, yes it works, thanks alot! – Nika May 02 '18 at 11:23

1 Answers1

1

You need to save the CountVectorizer object too just as you are saving the nb.

When you call

CountVectorizer(analyzer=text_process).fit(x)

you are re-training the CountVectorizer on new data, so the features (vocabulary) found by it will be different than at the training time and hence the saved nb which was trained on the earlier features complain about dimension mismatch.

Better to pickle them in different files, but if you want you can save them in same file.

To pickle both in same object:

file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'wb')
pickle.dump(bow_transformer, file)  <=== Add this
pickle.dump(nb, file)

To read both in next call:

file = open(sentiment_MNB_path + 'sentiment_MNB_model.pickle', 'rb')
bow_transformer = pickle.load(file)
nb = pickle.load(file)

Please look at this answer for more detail: https://stackoverflow.com/a/15463472/3374996

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Do I pickle the Vector in a seperate file or do I do this in sentiment_MNB_model.pickle ? – Nika Apr 25 '18 at 06:48