Handling categorical features using scikit-learn

Question

What am I doing?

I am solving a classification problem using Random Forests. I have a set of strings of a fixed length (10 characters long) that represent DNA sequences. DNA alphabet consists of 4 letters, namely A, C, G, T.

Here's a sample of my raw data:

ATGCTACTGA
ACGTACTGAT
AGCTATTGTA
CGTGACTAGT
TGACTATGAT

Each DNA sequence comes with experimental data describing a real biological response; the molecule was seen to elicit biological response (1), or not (0).

Problem:

The training set consists of both, categorical (nominal) and numerical features. It is of the following structure:

training_set = [
  {'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
   'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
   'mass':370.2, 'temp':70.0},
  {'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A', 
   'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T', 
   'mass':400.3, 'temp':67.2},
]

target = [1, 0]

I successfully create the classifier using the DictVectorizer class to encode nominal features, but I'm having problems while performing predictions on my testing data.

Below is the simplified version of my code accomplished so far:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer

training_set = [
  {'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
   'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
   'mass':370.2, 'temp':70.0},
  {'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A', 
   'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T', 
   'mass':400.3, 'temp':67.2},
]

target = [1, 0]

vec = DictVectorizer()
train = vec.fit_transform(training_set).toarray()

clf = RandomForestClassifier(n_estimators=1000)
clf = clf.fit(train, target)


# The following part fails.
test_set =   {
  'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
  'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
  'mass':370.2, 'temp':70.0}
vec = DictVectorizer()
test = vec.fit_transform(test_set).toarray()
print clf.predict_proba(test)

As a result, I got an error:

ValueError: Number of features of the model must  match the input. 
Model n_features is 20 and  input n_features is 12

possible duplicate of [how to force scikit-learn DictVectorizer not to discard features?](http://stackoverflow.com/questions/19770147/how-to-force-scikit-learn-dictvectorizer-not-to-discard-features) — Fred Foo, Jan 27 '14 at 10:35

score 3 · Accepted Answer · edited Nov 21 '14 at 15:52

3

You should use the same DictVectorizer object which created the train dataset to transform the test_set:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction import DictVectorizer

training_set = [
  {'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
   'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
   'mass':370.2, 'temp':70.0},
  {'p1':'A', 'p2':'C', 'p3':'G', 'p4':'T', 'p5':'A', 
   'p6':'C', 'p7':'T', 'p8':'G', 'p9':'A', 'p10':'T', 
   'mass':400.3, 'temp':67.2},
]

target = [1, 0]

vec = DictVectorizer()
train = vec.fit_transform(training_set).toarray()

clf = RandomForestClassifier(n_estimators=1000)
clf = clf.fit(train, target)


# The following part fails.
test_set =   {
  'p1':'A', 'p2':'T', 'p3':'G', 'p4':'C', 'p5':'T', 
  'p6':'A', 'p7':'C', 'p8':'T', 'p9':'G', 'p10':'A', 
  'mass':370.2, 'temp':70.0}

test = vec.transform(test_set).toarray()
print clf.predict_proba(test)

edited Nov 21 '14 at 15:52

Imran

12,950
8
64
79

answered Jan 26 '14 at 10:52

HYRY

94,853
25
187
187

Thank you, it works great. However, I noticed that dealing with a large number of strings, makes the resulting matrix very wide and overloads my memory. I was wondering, if you could suggest me some other ways for creating classifiers. In Scikit-learn documentation I read about [feature hashing](http://scikit-learn.org/stable/modules/feature_extraction.html) but I can't find a way to use it on my data. – sherlock85 Jan 26 '14 at 17:24
@s_sherly To make `FeatureHasher` work, you need to replace the categorical features with dummy variables yourself: `"p1=A": 1` etc. But it might be a better idea to do feature selection and/or dimension reduction with `TruncatedSVD` on the sparse matrix that comes out of the vectorizer. – Fred Foo Jan 27 '14 at 10:39

Handling categorical features using scikit-learn

1 Answers1

Linked