All classifiers in scikit-learn(*) expect a flat feature representation for samples, so you'll probably want to turn your string
feature into a vector. First, let get some incorrect assumptions out of the way:
DictVectorizer
is not for handling "lines of text", but for arbitrary symbolic features.
CountVectorizer
is also not for handling lines, but for entire text documents.
- Whether features are "equal in importance" is mostly up to the learning algorithm, though with a kernelized SVM, you can assign artificially small weights to features to make its dot products come out differently. I'm not saying that's a good idea, though.
There are two ways of handling this kind of data:
- Build a
FeatureUnion
consisting of a CountVectorizer
(or TfidfVectorizer
) for your textual data and a DictVectorizer
for the additional features.
Manually split the textual data into words, then use each word as a feature in a DictVectorizer
, e.g.
{"string:some": True, "string:arbitrary": True, "string:text": True,
"label1": "orange", "value1" : False }
Then the related questions:
- might this data structure indicate which SVM kernel is best?
Since you're handling textual data, try a LinearSVC
first and a polynomial kernel of degree 2 if it doesn't work. RBF kernels are a bad match for textual data, and cubic or higher-order poly kernels tend to overfit badly. As an alternative to kernels, you can manually construct products of individual features and train a LinearSVC
on that; sometimes, that works better than a kernel. It also gets rid of the feature importances issue as a LinearSVC
learns per-feature weights.
- Or would a Random Forest/Decision Tree, DBN, or Bayes classifier possibly do better in this case?
That's impossible to tell without trying. scikit-learn's random forests and dtrees unfortunately don't handle sparse matrices, so they're rather hard to apply. DBNs are not implemented.
- Should I be using feature selection?
Impossible to tell without seeing the data.
(*) Except SVMs if you implement custom kernels, which is such an advanced topic that I won't discuss it now.