python scikit - ValueError

Question

I am following the post in stackoverflow here on how to save a classifier. When I try doing the way mentioned in the second post. But I keep getting

ValueError: Vocabulary wasn't fitted or is empty!

My training code is as follows:

train = load_files(learning_data_train)
count_vect = CountVectorizer(tokenizer=tokenize,stop_words='english')
X_train_counts = count_vect.fit_transform(train.data)
clf = SGDClassifier(loss='hinge', penalty='l1',alpha=1e-3, n_iter=5).fit(X_train_counts, train.target)
filename = "SGD.pk1"
joblib.dump(clf, filename)

And my testing code is as follows:

count_vect = CountVectorizer(tokenizer=tokenize,stop_words='english')
filename = "SGD.pk1"
clf = joblib.load(filename)
print clf 
file= "testfolder/"
docs_new = []
for i in os.listdir(file):
    docs_new.append(open(file+i,"r").read())
X_new_counts = count_vect.transform(docs_new)
predicted = clf.predict(X_new_counts)
for doc, category in zip(docs_new, predicted):
    print(' => %s' % ( train.target_names[category]))

The error is thrown when executing

X_new_counts = count_vect.transform(docs_new)

is there something I am doing wrong here?

score 0 · Answer 1 · answered Dec 11 '14 at 11:04

You have used CountVectorizer, try with fit_transform

X_new_counts = count_vect.fit_transform(docs_new)

check :

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer.fit_transform

python scikit - ValueError

1 Answers1