I'm working with a small data set of 5 variables and ~90k observations. I've tried fitting a random forest classifier mimicking the iris example from http://blog.yhathq.com/posts/random-forests-in-python.html. However, my challenge is that my predicted values are all the same: 0. I'm new to Python, but familiar with R. Not sure if this is a coding mistake, or if this means my data is trash.
from sklearn.ensemble import RandomForestClassifier
data = train_df[cols_to_keep]
data = data.join(dummySubTypes.ix[:, 1:])
data = data.join(dummyLicenseTypes.ix[:, 1:])
data['is_train'] = np.random.uniform(0, 1, len(data)) <= .75
#data['type'] = pd.Categorical.from_codes(data['type'],["Type1","Type2"])
data.head()
Mytrain, Mytest = data[data['is_train']==True], data[data['is_train']==False]
Myfeatures = data.columns[1:5] # string of feature names: subtype dummy variables
rf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(Mytrain['type'])
rf.fit(Mytrain[Myfeatures], y)
data.target_names = np.asarray(list(set(data['type'])))
preds = data.target_names[rf.predict(Mytest[Myfeatures])]
Predictions of one class, Type1:
In[583]: pd.crosstab(Mytest['type'], preds, rownames=['actual'], colnames ['preds'])
Out[582]:
preds Type1
actual
Type1 17818
Type2 7247
Update: First few rows of data:
In[670]: Mytrain[Myfeatures].head()
Out[669]:
subtype_INDUSTRIAL subtype_INSTITUTIONAL subtype_MULTIFAMILY \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
subtype_SINGLE FAMILY / DUPLEX
0 0
1 0
2 0
3 1
4 1
When I predict on the training inputs, I get predictions of only one class:
In[675]: np.bincount(rf.predict(Mytrain[Myfeatures]))
Out[674]: array([ 0, 75091])