0

I'm working with a small data set of 5 variables and ~90k observations. I've tried fitting a random forest classifier mimicking the iris example from http://blog.yhathq.com/posts/random-forests-in-python.html. However, my challenge is that my predicted values are all the same: 0. I'm new to Python, but familiar with R. Not sure if this is a coding mistake, or if this means my data is trash.

from sklearn.ensemble import RandomForestClassifier
data = train_df[cols_to_keep]
data = data.join(dummySubTypes.ix[:, 1:])
data = data.join(dummyLicenseTypes.ix[:, 1:])
data['is_train'] = np.random.uniform(0, 1, len(data)) <= .75
#data['type'] = pd.Categorical.from_codes(data['type'],["Type1","Type2"])
data.head()
Mytrain, Mytest = data[data['is_train']==True], data[data['is_train']==False]
Myfeatures = data.columns[1:5] # string of feature names: subtype dummy     variables
rf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(Mytrain['type'])
rf.fit(Mytrain[Myfeatures], y)
data.target_names = np.asarray(list(set(data['type'])))
preds = data.target_names[rf.predict(Mytest[Myfeatures])]

Predictions of one class, Type1:

In[583]: pd.crosstab(Mytest['type'], preds, rownames=['actual'], colnames ['preds'])
Out[582]: 
preds          Type1
actual                   
Type1          17818
Type2          7247

Update: First few rows of data:

In[670]: Mytrain[Myfeatures].head()
Out[669]: 
subtype_INDUSTRIAL  subtype_INSTITUTIONAL  subtype_MULTIFAMILY  \
0                   0                      0                    0   
1                   0                      0                    0   
2                   0                      0                    0   
3                   0                      0                    0   
4                   0                      0                    0   

subtype_SINGLE FAMILY / DUPLEX  
0                               0  
1                               0  
2                               0  
3                               1  
4                               1 

When I predict on the training inputs, I get predictions of only one class:

In[675]: np.bincount(rf.predict(Mytrain[Myfeatures]))
Out[674]: array([    0, 75091])
user2205916
  • 3,196
  • 11
  • 54
  • 82

1 Answers1

3

There are several issues with your code, but the most glaring one is this:

data.target_names = np.asarray(list(set(data['type'])))
preds = data.target_names[rf.predict(Mytest[Myfeatures])]

sets in Python are inherently unordered, so there's no guarantee that the predictions will be correctly labeled after this operation.

Here's a cleaned-up version of your code:

# build your data
data = train_df[cols_to_keep]
data = data.join(dummySubTypes.ix[:, 1:])
data = data.join(dummyLicenseTypes.ix[:, 1:])

# split into training/testing sets
from sklearn.cross_validation import train_test_split
train, test = train_test_split(data, train_size=0.75)

# fit the classifier; scikit-learn factorizes labels internally
features = data.columns[1:5]
target = 'type'
rf = RandomForestClassifier(n_jobs=2)
rf.fit(train[features], train[target])

# predict and compute confusion matrix
preds = rf.predict(test[features])
print(pd.crosstab(test[target], preds,
                  rownames=['actual'],
                  colnames=['preds']))

If the results are still not as you would expect, I'd suggest doing some hyperparameter optimization on your random forest using scikit-learn's grid_search tools.

jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • Thank you very much for your advice. Definitely something I can't find in a book. I'll take a look at your suggestions. – user2205916 Nov 22 '15 at 16:04
  • Re: your comment about sets being unordered, I was trying to get unique values from the list per the popular answer here: http://stackoverflow.com/questions/12897374/get-unique-values-from-a-list-in-python Would you say that is an incorrect answer? – user2205916 Nov 22 '15 at 16:41
  • That's certainly a valid way to get unique values(though np.unique would be faster), but the values are returned in an arbitrary order. That becomes a problem when you index into them in the next line. – jakevdp Nov 22 '15 at 16:50