Combining multiple parameters for creating SVM vector

Question

New to scikit-learn and I am working with some data like the following.

data[0] = {"string": "some arbitrary text", "label1": "orange", "value1" : False }
data[0] = {"string": "some other arbitrary text", "label1": "red", "value1" : True }

For single lines of text there is CountVectorizer and DictVectorizer in the pipeline before TfidfTransformer. The output of these could be concatenated, I'm hoping with the following caveat: The arbitrary text I don't want to be equal in importance to the specific, limited and well-defined parameters.

Finally, some other questions, possibly related

might this data structure indicate which SVM kernel is best?
Or would a Random Forest/Decision Tree, DBN, or Bayes classifier possibly do better in this case? Or an Ensemble method? (The output is multi-class)
I see there is an upcoming feature for feature union, but this is to run different methods over the same data and combine them.
Should I be using feature selection?

See also:

score 1 · Accepted Answer · answered Apr 01 '13 at 11:10

All classifiers in scikit-learn(*) expect a flat feature representation for samples, so you'll probably want to turn your string feature into a vector. First, let get some incorrect assumptions out of the way:

DictVectorizer is not for handling "lines of text", but for arbitrary symbolic features.
CountVectorizer is also not for handling lines, but for entire text documents.
Whether features are "equal in importance" is mostly up to the learning algorithm, though with a kernelized SVM, you can assign artificially small weights to features to make its dot products come out differently. I'm not saying that's a good idea, though.

There are two ways of handling this kind of data:

Build a FeatureUnion consisting of a CountVectorizer (or TfidfVectorizer) for your textual data and a DictVectorizer for the additional features.

Manually split the textual data into words, then use each word as a feature in a DictVectorizer, e.g.

{"string:some": True, "string:arbitrary": True, "string:text": True,
 "label1": "orange", "value1" : False }

Then the related questions:

might this data structure indicate which SVM kernel is best?

Since you're handling textual data, try a LinearSVC first and a polynomial kernel of degree 2 if it doesn't work. RBF kernels are a bad match for textual data, and cubic or higher-order poly kernels tend to overfit badly. As an alternative to kernels, you can manually construct products of individual features and train a LinearSVC on that; sometimes, that works better than a kernel. It also gets rid of the feature importances issue as a LinearSVC learns per-feature weights.

Or would a Random Forest/Decision Tree, DBN, or Bayes classifier possibly do better in this case?

That's impossible to tell without trying. scikit-learn's random forests and dtrees unfortunately don't handle sparse matrices, so they're rather hard to apply. DBNs are not implemented.

Should I be using feature selection?

Impossible to tell without seeing the data.

(*) Except SVMs if you implement custom kernels, which is such an advanced topic that I won't discuss it now.

Thanks for answering a mix of specific and general questions thoroughly. LinearSVC can have class weights [1]. Why do you recommend against? Or is it that you aren't recommending because it's a data specific question? [1] - http://scikit-learn.sourceforge.net/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC — Jonathan Hendler, Apr 01 '13 at 19:59
I'm not recommending against `LinearSVC`, I'm all for it for handling textual data. And class weights have nothing to do with this. — Fred Foo, Apr 01 '13 at 20:57
Sorry, I meant I was trying to clarify that you were against using weights. " ... though with a kernelized SVM, you can assign artificially small weights to features to make its dot products come out differently. I'm not saying that's a good idea, though." Thanks again. — Jonathan Hendler, Apr 01 '13 at 21:41
*Feature* weights, not class weights. Clarifying this comment: I personally have better experience with generating products of features manually (followed by feature selection) than using polynomial kernels and `SVC`. — Fred Foo, Apr 02 '13 at 09:51

Combining multiple parameters for creating SVM vector

1 Answers1