6

I'm working on a regression problem, and have been using both the R randomForest package as well as the python sklearn random forest regression estimator.

The R package can calculate the feature importance score in two different ways:

  1. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences.

  2. The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares (RSS).

Whereas sklearn does it only in the latter way (see here for details).

I've been interested in comparing method #2 in both implementations so I've done the following:

R

iteration_count <- 3
seeds <- seq(1,iteration_count,1)
tree_count <- 500

for(i in 1:iteration_count) {
  set.seed(seeds[[i]])
  rfmodels[[i]]<- randomForest(y ~ .,X,ntree=tree_count,importance=TRUE,na.action=na.omit)
}

# convert all iterations into matrix form
imp_score_matrix <- do.call(cbind, lapply(models_selected, function(x) { importance(x, scale=T, type=1)[,1] }))

# Calculate mean and s.d. for importance ranking of each feature based on a matrix of feature importance scores
imp_score_stats <- (cbind(rowMeans(imp_score_matrix),rowSds(imp_score_matrix)))

# Order the matrix so that the features are ranked by mean (most important features will be in the last rows)
ordered_imp_score_stats <- imp_score_stats[order(imp_score_stats[,1]),]

sklearn

# get FIS through mean decrease in impurity (default method for sklearn)
num_iter = 3 # number of times to generate FIS; will average over these scores
trees = 500
seeds = [l for l in range(num_iter)]
FIS = []

# R implementation of RF settings - https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
num_features = 1/3.0 # see mtry
leaf = 5 # see nodesize

FIS_map = {v:k for k,v in enumerate(X.columns.values)} # {feature: i}
for i in range(num_iter):
    print "Iteration", i
    clf = RandomForestRegressor(n_jobs = -1, n_estimators = trees, random_state = seeds[i],
                               max_features = num_features, min_samples_leaf = leaf)
    clf = clf.fit(X,y)
    FIS.append(clf.feature_importances_)

FIS_stats = pd.DataFrame(FIS).describe().T # will have columns mean, std, etc
FIS_stats = FIS_stats.sort("mean", ascending = False) # most important features on top
FIS_stats['OTU'] = FIS_map # add the OTU ID
FIS_stats = FIS_stats.set_index('OTU')
FIS_stats = FIS_stats[FIS_stats['mean'] > 0] # remove those OTU features with no mean importance 

As you can see, I've tried to adjust the default settings in sklearn to match those used in R. The problem is, I get different results for each implementation. Now, I understand that there are various non-deterministic dimensions to random forests, so I don't expect the features to be ranked exactly the same; however, I'm seeing almost no overlap in important features.

Furthermore, when I use the best X features, those chosen by R perform much better than those in sklearn on a hold-out sample set.

Am I doing something wrong? What could explain the resulting difference?

Update

Per the comment regarding the feature importances being calculated with the Gini index in sklearn, the source code for random forest regression shows that MSE is used to calculate impurity.

So, it looks like R uses RSS and sklearn uses MSE, the relationship being:

enter image description here

Could this account for the difference?

Community
  • 1
  • 1
Constantino
  • 2,243
  • 2
  • 24
  • 41
  • 1
    Have you tried to increase the number of iterantions for both the scripts? – Edgar Derby Aug 18 '15 at 22:31
  • Does not make a difference :( – Constantino Aug 18 '15 at 23:52
  • 1
    The documentation of the R random forest says: _For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares._ As you are doing regression, maybe this is the difference? In the scikit-learn doc only Gini criterion is mentioned to measure impurity. – Alexander Bauer Aug 19 '15 at 01:19
  • 2
    I find your R code `randomForest(X ~ .,y,ntree=tree_count, ...)` strange. If your `X` data frame does not include response, you need `randomForest(X, y, ntree=tree_count, ...)` without using the formula. If `X` does include response, you need formula interface only `randomForest(y ~ ., data=X, ntree=tree_count, ...) ` With your code as it is, I am not sure, what you are fitting. Could you check this? – lanenok Aug 21 '15 at 19:57
  • indeed you're right; I had made a mistake when cleaning up the code for posting - I've updated it to reflect the correct way I'm actually calculating FIS – Constantino Aug 21 '15 at 21:36

0 Answers0