-2

I am currently using scikit-learn to perform classification of news articles and I was wondering which classifier should I use. I have the training set with labelled data, which makes this a supervised learning problem and an article can belong to multiple categories (say finance and politic), making this a multi-label scenario.

I am currently using CountVectorizer for the preprocessing, then Linear SVC with MultiOutputClassifier to build the model. I use LinearSVC by following the flow chart here http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html.

classifier = MultiOutputClassifier(LinearSVC())

But I am not sure if there is a better algorithm for my use case. Any comments on my approach?

Steve
  • 9
  • 1
  • Possible duplicate of [use scikit-learn to classify into multiple categories](http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories) – mohammad May 12 '17 at 09:51
  • Try `tf-idf` and random forest. – polkovnikov.ph May 12 '17 at 09:51
  • @mohammad I am aware of that question, but that question can't even get the thing working properly. In my case, I already got my multi-label but I was just wondering what is the better classifier in my use case. In your tagged question there is no debate at all regarding which classifier to use which is what I am looking for. – Steve May 12 '17 at 10:01
  • @polkovnikov.ph tf-idf is just a transformer, correct? BTW If I were to use Random Forest do I still need to pass LinearSVC as meta-estimator? – Steve May 12 '17 at 10:02

1 Answers1

0

Try sgdClassifier from scikit-learn, which would give you more options for model building as well it will be faster than LinearSVM.

You should go with OneVsRestclassifier for the same instead of multiOutputClassifier, since you are looking for multi-label output.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77