Python Predict Features from Trained Set

Question

I'm trying to predict some features from trained data. However , I'm in trouble with python. I have to make sure path of it.

My first python file looks like ;

dataset = pandas.read_csv('/root/Desktop/data.csv' , encoding='cp1252')
test_size = 0.2

X_train_raw, X_test_raw, y_train, y_test = train_test_split(dataset['text'],dataset['age'],test_size=test_size)

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
classifier = LogisticRegression()
svm_=classifier.fit(X_train, y_train)


save = joblib.dump(svm_,'myfile.pkl')

Second python file looks like ;

datasetforprediction = pandas.read_csv('/root/Desktop/predict.csv' , encoding='cp1252')


load = joblib.load('myfile.pkl')  
vectorizer = TfidfVectorizer()
Test = vectorizer.fit_transform(datasetforprediction['text'])

x=load.predict(Test)

Error : ValueError: X has 505 features per sample; expecting 18063

score 1 · Answer 1 · answered Nov 24 '19 at 17:49

Your training and prediction(test) set has different dimensions. To solve this, while training save the vocabulary_ and when you predict use the same vocabulary_

vectorizer = TfidfVectorizer(min_df=2)
plaintexts_tfidf = vectorizer.fit_transform(plaintexts)
vocab = vectorizer.vocabulary_
# later in an another script after loading the vocab from disk
vectorizer = TfidfVectorizer(min_df=2, vocabulary=vocab)
titles_tfidf = vectorizer.fit_transform(titles)

You may refer keep-tfidf-result-for-predicting-new-content-using-scikit-for-python and tfidfvectorizer-how-does-the-vectorizer-with-fixed-vocab-deal-with-new-words

emremrah · Accepted Answer · 2019-11-24T18:07:59.440

With assuming your data.csv does not fully involve the predict.csv, you are fitting a vectorizer (say vectorizer1) with train data and transforming it. After that, you fit another, completely new vectorizer (say vectorizer2) with predict data and then transfroming it. But these two data are not the same, so vectorizer1 is not equal to vectorizer2; they are different because they are fitted in different data. It raises an error because It didn't see some of the data in predict.csv before.

What you should do is:

Merge train and predict data
Train a vectorizer (only fit) with that merged, full data
Transform your train data and train a model
Save the vectorizer as well as you do for svm model (you can use pickle)
Load the vectorizer as well as you do for svm model
Transform your predict data and predict it

Why you need to merge the data and train a vectorizer with it? Because some of data in your train set may not be in the prediction set. So if you train a vectorizer ONLY WITH train set, when predicting the prediction set, if the vectorizer encounter an unseen data, it will not transform it properly. That's why you need to fit your vectorizer with the full data.

Python Predict Features from Trained Set

2 Answers2