0

New to scikit-learn and I am working with some data like the following.

data[0] = {"string": "some arbitrary text", "label1": "orange", "value1" : False }
data[0] = {"string": "some other arbitrary text", "label1": "red", "value1" : True }

For single lines of text there is CountVectorizer and DictVectorizer in the pipeline before TfidfTransformer. The output of these could be concatenated, I'm hoping with the following caveat: The arbitrary text I don't want to be equal in importance to the specific, limited and well-defined parameters.

Finally, some other questions, possibly related

  • might this data structure indicate which SVM kernel is best?
  • Or would a Random Forest/Decision Tree, DBN, or Bayes classifier possibly do better in this case? Or an Ensemble method? (The output is multi-class)
  • I see there is an upcoming feature for feature union, but this is to run different methods over the same data and combine them.
  • Should I be using feature selection?

See also:

Community
  • 1
  • 1
Jonathan Hendler
  • 1,239
  • 1
  • 17
  • 23

1 Answers1

1

All classifiers in scikit-learn(*) expect a flat feature representation for samples, so you'll probably want to turn your string feature into a vector. First, let get some incorrect assumptions out of the way:

  • DictVectorizer is not for handling "lines of text", but for arbitrary symbolic features.
  • CountVectorizer is also not for handling lines, but for entire text documents.
  • Whether features are "equal in importance" is mostly up to the learning algorithm, though with a kernelized SVM, you can assign artificially small weights to features to make its dot products come out differently. I'm not saying that's a good idea, though.

There are two ways of handling this kind of data:

  1. Build a FeatureUnion consisting of a CountVectorizer (or TfidfVectorizer) for your textual data and a DictVectorizer for the additional features.
  2. Manually split the textual data into words, then use each word as a feature in a DictVectorizer, e.g.

    {"string:some": True, "string:arbitrary": True, "string:text": True,
     "label1": "orange", "value1" : False }
    

Then the related questions:

  • might this data structure indicate which SVM kernel is best?

Since you're handling textual data, try a LinearSVC first and a polynomial kernel of degree 2 if it doesn't work. RBF kernels are a bad match for textual data, and cubic or higher-order poly kernels tend to overfit badly. As an alternative to kernels, you can manually construct products of individual features and train a LinearSVC on that; sometimes, that works better than a kernel. It also gets rid of the feature importances issue as a LinearSVC learns per-feature weights.

  • Or would a Random Forest/Decision Tree, DBN, or Bayes classifier possibly do better in this case?

That's impossible to tell without trying. scikit-learn's random forests and dtrees unfortunately don't handle sparse matrices, so they're rather hard to apply. DBNs are not implemented.

  • Should I be using feature selection?

Impossible to tell without seeing the data.

(*) Except SVMs if you implement custom kernels, which is such an advanced topic that I won't discuss it now.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks for answering a mix of specific and general questions thoroughly. LinearSVC can have class weights [1]. Why do you recommend against? Or is it that you aren't recommending because it's a data specific question? [1] - http://scikit-learn.sourceforge.net/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC – Jonathan Hendler Apr 01 '13 at 19:59
  • I'm not recommending against `LinearSVC`, I'm all for it for handling textual data. And class weights have nothing to do with this. – Fred Foo Apr 01 '13 at 20:57
  • Sorry, I meant I was trying to clarify that you were against using weights. " ... though with a kernelized SVM, you can assign artificially small weights to features to make its dot products come out differently. I'm not saying that's a good idea, though." Thanks again. – Jonathan Hendler Apr 01 '13 at 21:41
  • *Feature* weights, not class weights. Clarifying this comment: I personally have better experience with generating products of features manually (followed by feature selection) than using polynomial kernels and `SVC`. – Fred Foo Apr 02 '13 at 09:51