1

I am trying to classify text data, with Scikit Learn, with the method shown here. (http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) except I am loading my own dataset.

I'm getting results, but I want to find the accuracy of the classification results.

    from sklearn.datasets import load_files

    text_data = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/train", description=None, categories=None, load_content=True, shuffle=True, encoding='latin-1', decode_error='ignore', random_state=0)

    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import SGDClassifier
    text_clf = Pipeline([('vect', CountVectorizer()),
                        ('tfidf', TfidfTransformer()),
                        ('clf', LinearSVC(loss='hinge', penalty='l2',
                                                random_state=42)),
    ])

    _ = text_clf.fit(text_data.data, text_data.target)

    docs_new = ["Some test sentence here.",]

    predicted = text_clf.predict(docs_new)
    print np.mean(predicted == text_data.target) 

    for doc, category in zip(docs_new, predicted):
        print('%r => %s' % (doc, text_data.target_names[predicted]))

Here, I get the np.mean prediction as 0.566.

If I try:

twenty_test = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/testing", description=None, categories=None, load_content=True, shuffle=True, encoding='latin-1', decode_error='ignore', random_state=0)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

Now it prints it out as 1.

I don't understand how this works, and what exactly np.mean is, and why it's showing different results when it's trained on the same data.

The "train" folder has approx 15 documents, and the text folder also has approx 15 documents, in case that matters. I'm very new to Scikit Learn and machine learning in general, so any help greatly appreciated. Thanks!

pithukuli
  • 13
  • 5

2 Answers2

0

precict() returns an array for the predicted class label for given unknown text. See the source here.

docs_new = ['God is love', 'OpenGL on the GPU is fast', 'java', '3D', 'Cinema 4D']
predicted = clf.predict(X_new_tfidf)
print predicted
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

[3 1 2 1 1]
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
'java' => sci.med
'3D' => comp.graphics
'Cinema 4D' => comp.graphics

As you can see predicted returns an array. The numbers in the array correspond to indices for the labels, which are accessed in the subsequent for loop.

When you perform np.mean this is to determine the accuracy of the classifier and is not applicalble in your first example since the text "Some text here" has no label. This piece of text though, can be used to predict which label this belongs to. This can be achieved in you script by changing:

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, text_data.target_names[predicted]))

to:

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, text_data.target_names[category]))

Where as your second call to np.mean returns 1 which means the classifier was able to predict with 100% accuracy the unseen documents to their correct label. Since, the twenty_test data also has label information.

To obtain further information on the accuracy of your classifier you can:

from sklearn import metrics
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names)) 


                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502

and if you want a confusion matrix you can:

metrics.confusion_matrix(twenty_test.target, predicted)

array([[258,  11,  15,  35],
       [  4, 379,   3,   3],
       [  5,  33, 355,   3],
       [  5,  10,   4, 379]])
Harpal
  • 12,057
  • 18
  • 61
  • 74
0
text_data = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/train", ...)

According to the documentation, that line loads your file's contents from C:/Users/USERNAME/projects/machine_learning/my_project/train into text_data.data. It will also load target labels (represented by their integer indexes) for each document into text_data.target. So text_data.data should be a list of strings and text_data.target a list of integers. The labels are derived from the folders in which the files are. Your explanation sounds like you don't have any subfolders in C:/.../train/ and C:/.../test/, which will probably create problems (e. g. all labels being identical).

from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', LinearSVC(loss='hinge', penalty='l2',
                                            random_state=42)),
])

_ = text_clf.fit(text_data.data, text_data.target)

The above lines are training (in .fit()) a classifier on your example documents. Very roughly speaking, you are telling the classifier (LinearSVC) how often which words appear in which documents (CountVectorizer, TfidfTransformer) and which label each of these documents has (text_data.target). Your classifier then tries to learn a rule that basically maps those word frequencies (TF-IDF values) to labels (e. g. dog and cat strongly indicating the label animal).

docs_new = ["Some test sentence here.",]
predicted = text_clf.predict(docs_new)

After training your classifier on example data, you provide one completely new document and let your classifier predict the most appropriate label for that document based on what it has learned. predicted should then be a list of (indexes of) labels with just one element (because you had one document), e. g. [5].

print np.mean(predicted == text_data.target)

Here you are comparing the list of predictions (1 element) to the list of labels from your training data (15 elements) and then take the mean of the result. That doesn't make a whole lot of sense, because of the different list sizes and because your new example document doesn't really have anything to do with the training labels. Numpy will probably resort to comparing your predicted label (e. g. 5) to every single element in text_data.target. That will create a list like [False, False, False, True, False, True, ...], which will be interpreted by np.mean as [0, 0, 0, 1, 0, 1, ...], resulting in the mean being 1/15 * (0+0+0+1+0+1+...).

What you should be doing is e. g. something like:

docs_new = ["Some test sentence here."]
docs_new_labels = [1] # correct label index of the document

predicted = text_clf.predict(docs_new)
print np.mean(predicted == docs_new_labels) 

At least you shouldn't compare to your training labels. Notice that if np.mean returns 1 then all documents where correctly classified. In the case of your test dataset the seems to happen. Make sure that your test and training data files are actually different, as 100% accuracy isn't very common (might however be an artifact of your low amount of training files). On a sidenote, notice that are currently not using tokenization, so for your classifier here and here. will be completely different words.

aleju
  • 2,376
  • 1
  • 17
  • 10
  • Thanks! That was very useful and clarified many things for me! (1) I do have folders inside of `C:/.../train/` and `C:/.../testing/`. Both of them have the same folder outline, eg. "sports news" and "cultural news", in the same order as shown [here](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html). (2) `np.mean(predicted == docs_new_labels)`, like you said, returns 1.0. Does this mean that np.mean is the confidence the algo has in the level of accuracy of the prediction? (4) Could you expand on your sidenote about tokenization more? Thanks again! – pithukuli May 20 '15 at 04:00
  • And sorry if this is a totally different question, but what's the best way to check if the algorithm is actually performing well? – pithukuli May 20 '15 at 04:05
  • Using `np.mean` just tells you how often your classifier is correct on your testset (e.g. 0.5 = every second time). Tokenization transforms your content into more tokens. A typical rule would be to split off punctuation, e.g. `here.` (one rare word) would become `here .` (two common words). You can use nltk for that, see [here](http://stackoverflow.com/questions/15057945/how-do-i-tokenize-a-string-sentence-in-nltk). One good way to test your algorithm is to gather a decent amount of unique test content and then use `metrics.classification_report` as in Harpal's answer. – aleju May 20 '15 at 10:19