Goal: Predict labels on my original data
Background: I constructed an SVM classifier
I am using the following code:
0) Import modules
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score,accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support
1) X_list
and y
type(X_list) #list, strings
len(X_list) #2163
type(y) #numpy.ndarray
len(y) #2163
2) convert X_list
from string to float, use tfidf
tfidf = TfidfVectorizer()
X_vec = tfidf.fit_transform(X_list)
X = X_vec.toarray()
3) X
shape
X.shape (2163, 8753)
4) 10 fold validation and SVM
skf = StratifiedKFold(n_splits=10)
clf = svm.SVC(kernel='linear', C=1)
5) loop through 10 folds
precision_scores = []
recall_scores = []
f_scores = []
for train_index, test_index in skf.split(X, y):
X_train = X[train_index]
X_test = X[test_index]
y_train = y[train_index]
y_test = y[test_index]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
precision_scores.append(scores[0])
recall_scores.append(scores[1])
f_scores.append(scores[2])
6) Predict on original dataset X_original
type(X_original) #list, strings
len(X_original) #2163
7) Convert X_original
from string to float
tfidf = TfidfVectorizer()
X_original_transform = tfidf.transform(X_original)
But when I do so I get the following Error
`NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.`
SO has a similar question but it seems different from my issue NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted
8) How do I fix this error?