Error predicting: X has n features per sample, expecting m

Question

I got the following code, where I transform a text to tf:

...
x_train, x_test, y_train, y_test = model_selection.train_test_split(dataset['documents'],dataset['classes'],test_size=test_percentil)
#Term document matrix
count_vect = CountVectorizer(ngram_range=(1, Ngram), min_df=1, max_features=MaxVocabulary)
x_train_counts = count_vect.fit_transform(x_train)
x_test_counts=count_vect.transform(x_test)
#Term Inverse-Frequency
tf_transformer = TfidfTransformer(use_idf=True).fit(x_train_counts)
lista=tf_transformer.get_params()
x_train_tf = tf_transformer.transform(x_train_counts)
x_test_tf=tf_transformer.transform(x_test_counts)
...

Then, I train a model and save it using pickle. The problem comes when, in another program, I try to predict new data. Basically, I got:

count_vect = CountVectorizer(ngram_range=(1, 1), min_df=1, max_features=None)
x_counts = count_vect.fit_transform(dataset['documents'])

#Term Inverse-Frequency
tf_transformer = TfidfTransformer(use_idf=True).fit(x_counts)
x_tf = tf_transformer.transform(x_train_counts)

model.predict(x_tf)

When I execute this code, the output is

ValueError: X has 8933 features per sample; expecting 7488

I know this is a problem with the TfIdf representation, and I hear that I need to use the same tf_transformer and vectorizer to get the expected input shape, but I don't know how to achieve this. I can store the others transformers and vectorizers, but I have tried using different combinations and I got nothing.

why do you initizalize again the `count_vect` and `tf_transformer` ? — J. Doe, Jul 06 '18 at 10:18
can't you save and load the TFidf result with `pickle` or `joblib`? — J. Doe, Jul 06 '18 at 10:22
is it helping https://stackoverflow.com/questions/29788047/keep-tfidf-result-for-predicting-new-content-using-scikit-for-python — J. Doe, Jul 06 '18 at 10:23

score 1 · Accepted Answer · answered Jul 06 '18 at 10:36

1

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
a = pd.Series(["hello, this is me", "hello this is me too"])
b = pd.Series(["hello, this is John", "hi it's Doe"])
tfidf = TfidfVectorizer().fit(a)
joblib.dump(tfidf, 'path_to/tfidf.pkl')
tfidf = joblib.load('path_to/tfidf.pkl')
tfidf.transform(b).todense()

answered Jul 06 '18 at 10:36

J. Doe

3,458
2
24
42

used fit_transform on the tfidfVectorizer() instead of fit. what to do next? – Led Sep 15 '21 at 07:35

Vivek Kumar · Answer 2 · 2018-07-06T11:45:45.477

In another program you are instantiating a new object, which will not know that previous data has those many columns.

You need to save the CountVectorizer and TfidfTransformer the same way as you saved the model and load them the same way in another program.

Also, you can just use the TfidfVectorizer instead of CountVectorizer + TfidfTransformer, because it does the combined thing and will make your work (saving and loading them easier).

So during training do this:

...
x_train, x_test, y_train, y_test = model_selection.train_test_split(dataset['documents'],dataset['classes'],test_size=test_percentil)
#Term document matrix
tf_vect = TfidfVectorizer(ngram_range=(1, Ngram), min_df=1, max_features=MaxVocabulary, use_idf=True)
x_train_tf = tf_vect.fit_transform(x_train)
x_test_tf = tf_vect.transform(x_test)

...

Error predicting: X has n features per sample, expecting m

2 Answers2

Linked