7

I work in customer support, and I'm using scikit-learn to predict tags for our tickets, given a training set of tickets (approx. 40,000 tickets in the training set).

I'm using the classification model based on this one. It's predicting just "()" as the tags for many of my test set of tickets, even though none of the tickets in the training set are without tags.

My training data for tags is a list of lists, like:

tags_train = [['international_solved'], ['from_build_guidelines my_new_idea eligibility'], ['dropbox other submitted_faq submitted_help'], ['my_new_idea_solved'], ['decline macro_backer_paypal macro_prob_errored_pledge_check_credit_card_us loading_problems'], ['dropbox macro__turnaround_time other plq__turnaround_time submitted_help'], ['dropbox macro_creator__logo_style_guide outreach press submitted_help']]

While my training data for ticket descriptions is just a list of strings, e.g.:

descs_train = ['description of ticket one', 'description of ticket two', etc]

Here's the relevant part of my code to build the model:

import numpy as np
import scipy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

# We have lists called tags_train, descs_train, tags_test, descs_test with the test and train data

X_train = np.array(descs_train)
y_train = tags_train
X_test = np.array(descs_test)  

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight='auto')))])

classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)

However, "predicted" gives a list that looks like:

predicted = [(), ('account_solved',), (), ('images_videos_solved',), ('my_new_idea_solved',), (), (), (), (), (), ('images_videos_solved', 'account_solved', 'macro_launched__edit_update other tips'), ('from_guidelines my_new_idea', 'from_guidelines my_new_idea macro__eligibility'), ()]

I don't understand why it's predicting blank () when there are none in the training set. Shouldn't it predict the closest tag? Can anyone recommend any improvements to the model I'm using?

Thank you so much for your help in advance!

Community
  • 1
  • 1
jegeragh
  • 85
  • 1
  • 7
  • [CountVectorizer documentation](http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) [TfidfTransformer documentation](http://scikit-learn.github.io/scikit-learn.org/0.8/modules/generated/scikits.learn.feature_extraction.text.TfidfTransformer.html) [OneVsRestClassifier documentation](http://scikit-learn.org/dev/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) – jegeragh Jun 04 '13 at 23:37
  • Do you want multi-class or multi-label classification? Is a ticket allowed to be tagged with more than one tag? – mbatchkarov Jun 07 '13 at 08:40

2 Answers2

5

The problem is with your tags_train variable. According to the OneVsRestClassifier documentation, the targets need to be "a sequence of sequences of labels", and your targets are lists of one element.

Below is an edited, self-contained and working version of your code. Note the change in tags_train, in particular the fact the the tags_train is a one-element tuple.

import numpy as np
import scipy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC


# We have lists called tags_train, descs_train, tags_test, descs_test with the test and train data
tags_train = [('label', ), ('international' ,'solved'), ('international','open')]
descs_train = ['description of ticket one', 'some other ticket two', 'label']

X_train = np.array(descs_train)
y_train = tags_train
X_test = np.array(descs_train)  

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight='auto')))])

classifier = classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)

print predicted

The output is

[('international',), ('international',), ('international', 'open')]
mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
0

Still facing the () prediction, even after converting the target from list of one element into sequences

enter image description here

Mayur Karmur
  • 2,119
  • 14
  • 35