I am doing a text classification in python and I want to use it in production environment for making prediction on new document. I am using TfidfVectorizer to build bagofWord.
I am doing:
X_train = vectorizer.fit_transform(clean_documents_for_train, classLabel).toarray()
Then I am doing cross validation and building the model using SVM. After that I am saving the model.
For making prediction on my test data I am loading that model in another script where I have the same TfidfVectorizer and I know I can't do fit_transform on my testing data. I have to do:
X_test = vectorizer.transform(clean_test_documents, classLabel).toarray()
But this is not possible because I have to fit first. I know there is a way. I can load my training data and perform fit_transform
like I did during building the model but my training data is very large and every time I want to predict I can't do that. So my question is:
- Is there a way I can use TfidfVectorizer on my test data and perform prediction ?
- Is there any other way to perform prediction ?