I'm working on a regression problem, and have been using both the R randomForest package as well as the python sklearn random forest regression estimator.
The R package can calculate the feature importance score in two different ways:
The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences.
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares (RSS).
Whereas sklearn does it only in the latter way (see here for details).
I've been interested in comparing method #2 in both implementations so I've done the following:
R
iteration_count <- 3
seeds <- seq(1,iteration_count,1)
tree_count <- 500
for(i in 1:iteration_count) {
set.seed(seeds[[i]])
rfmodels[[i]]<- randomForest(y ~ .,X,ntree=tree_count,importance=TRUE,na.action=na.omit)
}
# convert all iterations into matrix form
imp_score_matrix <- do.call(cbind, lapply(models_selected, function(x) { importance(x, scale=T, type=1)[,1] }))
# Calculate mean and s.d. for importance ranking of each feature based on a matrix of feature importance scores
imp_score_stats <- (cbind(rowMeans(imp_score_matrix),rowSds(imp_score_matrix)))
# Order the matrix so that the features are ranked by mean (most important features will be in the last rows)
ordered_imp_score_stats <- imp_score_stats[order(imp_score_stats[,1]),]
sklearn
# get FIS through mean decrease in impurity (default method for sklearn)
num_iter = 3 # number of times to generate FIS; will average over these scores
trees = 500
seeds = [l for l in range(num_iter)]
FIS = []
# R implementation of RF settings - https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
num_features = 1/3.0 # see mtry
leaf = 5 # see nodesize
FIS_map = {v:k for k,v in enumerate(X.columns.values)} # {feature: i}
for i in range(num_iter):
print "Iteration", i
clf = RandomForestRegressor(n_jobs = -1, n_estimators = trees, random_state = seeds[i],
max_features = num_features, min_samples_leaf = leaf)
clf = clf.fit(X,y)
FIS.append(clf.feature_importances_)
FIS_stats = pd.DataFrame(FIS).describe().T # will have columns mean, std, etc
FIS_stats = FIS_stats.sort("mean", ascending = False) # most important features on top
FIS_stats['OTU'] = FIS_map # add the OTU ID
FIS_stats = FIS_stats.set_index('OTU')
FIS_stats = FIS_stats[FIS_stats['mean'] > 0] # remove those OTU features with no mean importance
As you can see, I've tried to adjust the default settings in sklearn to match those used in R. The problem is, I get different results for each implementation. Now, I understand that there are various non-deterministic dimensions to random forests, so I don't expect the features to be ranked exactly the same; however, I'm seeing almost no overlap in important features.
Furthermore, when I use the best X features, those chosen by R perform much better than those in sklearn on a hold-out sample set.
Am I doing something wrong? What could explain the resulting difference?
Update
Per the comment regarding the feature importances being calculated with the Gini index in sklearn, the source code for random forest regression shows that MSE is used to calculate impurity.
So, it looks like R uses RSS and sklearn uses MSE, the relationship being:
Could this account for the difference?