I am trying to classify text data, with Scikit Learn, with the method shown here. (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) except I am loading my own dataset.
I'm getting results, but I want to find the accuracy of the classification results.
from sklearn.datasets import load_files
text_data = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/train", description=None, categories=None, load_content=True, shuffle=True, encoding='latin-1', decode_error='ignore', random_state=0)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LinearSVC(loss='hinge', penalty='l2',
random_state=42)),
])
_ = text_clf.fit(text_data.data, text_data.target)
docs_new = ["Some test sentence here.",]
predicted = text_clf.predict(docs_new)
print np.mean(predicted == text_data.target)
for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, text_data.target_names[predicted]))
Here, I get the np.mean prediction as 0.566.
If I try:
twenty_test = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/testing", description=None, categories=None, load_content=True, shuffle=True, encoding='latin-1', decode_error='ignore', random_state=0)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
Now it prints it out as 1.
I don't understand how this works, and what exactly np.mean is, and why it's showing different results when it's trained on the same data.
The "train" folder has approx 15 documents, and the text folder also has approx 15 documents, in case that matters. I'm very new to Scikit Learn and machine learning in general, so any help greatly appreciated. Thanks!