I want to save a text classifier. I am using a TfidfVectorizer in my code like this:
vectorizer = TfidfVectorizer(analyzer='word', tokenizer=tokenize, lowercase=True, stop_words='english',
max_features=1100)
corpus_data_features = vectorizer.fit_transform(train_data_df.Text.tolist() + test_data_df.Text.tolist())
# Convering the document term matrix to numpy nd array
corpus_data_features_nd = (corpus_data_features.toarray())
calibrated_svc.fit(X=corpus_data_features_nd[0:len(train_data_df)], y=train_data_df.Domain)
test_pred=calibrated_svc.predict(corpus_data_features_nd[len(train_data_df):])
so afer training I can save a model and reuse it. but when I want to reuse the model I must create corpus_data_features again:
corpus_data_features = vectorizer.fit_transform(train_data_df.Text.tolist() + test_data_df.Text.tolist())
and this kind of saving classifier could not help the speed of classifying. how can I split corpus_data_features into two part and use a saved vector for train_data_df and then add test_data_df when I load my model?