1

Below is my code I am trying for text classification model;

from sklearn.feature_extraction.text import TfidfVectorizer
ifidf_vectorizer = TfidfVectorizer()

X_train_tfidf = ifidf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape

(3, 16)

from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

Till now only training set has been vectorized into a full vocabulary. In order to perform analysis on test set I need to submit it to the same procedures. So I did

X_test_tfidf = ifidf_vectorizer.fit_transform(X_test) 
X_test_tfidf.shape
(2, 12)

And finally when trying to predict its showing error;

predictions = clf.predict(X_test_tfidf)

ValueError: X has 12 features per sample; expecting 16

But when I use pipeline from sklearn.pipeline import Pipeline then it worked fine;

Can’t I code the way I was trying?

vahdet
  • 6,357
  • 9
  • 51
  • 106
Sachin84
  • 47
  • 8

2 Answers2

1

The error is with fit_transform of test data. You fit_transform training data and only transform test data:

# change this
X_test_tfidf = ifidf_vectorizer.fit_transform(X_test) 
X_test_tfidf.shape
(2, 12)

# to 
X_test_tfidf = ifidf_vectorizer.transform(X_test) 
X_test_tfidf.shape

Reasons: When you do fit_transform, you teach your model the vectors with fit. The model learns the vectors to which they are used to transform data. You use the train data to learn the vectors, then you apply them to both train and test with transform

If you do a fit_transform on test data, you replaced the vectors learned in training data and replaced them with test data. Given that your test data is smaller than your train data, it is likely you would get two different vectorisation.

A Better Way The best way to do what you do is using Pipelines which will make your flow easy to understand

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline


clf = Pipeline(steps=[
('vectorizer', TfidfVectorizer()),
('model', LinearSVC()),
])

# train
clf.fit(X_train,y_train)

# predict
clf.predict(X_test)

This is easier as the transformation are taking care for you. You don’t have to worry about fit_transform when fitting the model or transform when predicting or scoring.

You can access the features independently if you with with


clf.named_steps('vectorizer') # or 'model'

Under the hood, when you do clf.fit, your data will pass throw your vectorizer using fit_transform and then to the model. When you predict or score, your data will pass throw your vectorizer with transform before reaching your model.

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
  • Thank you Prayson W. Daniel. i understood the issue. i tried both ways ; a separate transform for x_test and a sklean pipeline also. However it leads to question that ; model uses words vector or Vocab of Train data only and any word in test data which is not part of vocab will be missed isn't it ? – Sachin84 Jul 17 '20 at 06:35
  • Yes. Your initial model was trained on test data tokens. As you replaced train data with test data when you fit_transform your test data. Unless you are extremely luck that the train and test dataset contains similar tokens, you are most likely going to end up with a wrong shape vectorizer;) We usually don’t use the test data to train the model, but to evaluate how models would perform in unseen data. At the end you can combine train model on all your data. – Prayson W. Daniel Jul 17 '20 at 06:55
0

Your code fails as you are refitting the vectorizer with .fit_transform() on the test set X_test again. However, you should only transform the data with the vectorizer:

X_test_tfidf = ifidf_vectorizer.transform(X_test) 

Now it should work as expected. You only fit the ifidf_vectorizer according to X_train and transform all data according to this. It ensures that the same vocabulary is used and that you get outputs of the same shape.

afsharov
  • 4,774
  • 2
  • 10
  • 27