21

I have two RandomForestClassifier models, and I would like to combine them into one meta model. They were both trained using similar, but different, data. How can I do this?

rf1 #this is my first fitted RandomForestClassifier object, with 250 trees
rf2 #this is my second fitted RandomForestClassifier object, also with 250 trees

I want to create big_rf with all trees combined into one 500 tree model

mgoldwasser
  • 14,558
  • 15
  • 79
  • 103

2 Answers2

31

I believe this is possible by modifying the estimators_ and n_estimators attributes on the RandomForestClassifier object. Each tree in the forest is stored as a DecisionTreeClassifier object, and the list of these trees is stored in the estimators_ attribute. To make sure there is no discontinuity, it also makes sense to change the number of estimators in n_estimators.

The advantage of this method is that you could build a bunch of small forests in parallel across multiple machines and combine them.

Here's an example using the iris data set:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_iris

def generate_rf(X_train, y_train, X_test, y_test):
    rf = RandomForestClassifier(n_estimators=5, min_samples_leaf=3)
    rf.fit(X_train, y_train)
    print "rf score ", rf.score(X_test, y_test)
    return rf

def combine_rfs(rf_a, rf_b):
    rf_a.estimators_ += rf_b.estimators_
    rf_a.n_estimators = len(rf_a.estimators_)
    return rf_a

iris = load_iris()
X, y = iris.data[:, [0,1,2]], iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.33)
# in the line below, we create 10 random forest classifier models
rfs = [generate_rf(X_train, y_train, X_test, y_test) for i in xrange(10)]
# in this step below, we combine the list of random forest models into one giant model
rf_combined = reduce(combine_rfs, rfs)
# the combined model scores better than *most* of the component models
print "rf combined score", rf_combined.score(X_test, y_test)
mgoldwasser
  • 14,558
  • 15
  • 79
  • 103
  • 1
    Is there a way to generalize this to use other models -- Logistic regression, Guasian NB, SVM – Merlin May 20 '16 at 18:27
  • @mgoldwasser hi, I just read you answer and I have a more general question. Can I use features that they dont have the same length? Can for example one have 300 samples and the other 200? Sorry of the off topic but reading your answer, I am thinking to build a forest for each feature. – DimKoim Jun 23 '16 at 22:08
  • rf_a.n_estimators = len(rf_a.estimators_) .. Err.. shouldn't this be; rf_a.n_estimators += len(rf_a.n_estimators) ???? – Software Mechanic Oct 18 '16 at 12:10
  • 2
    @SoftwareMechanic code is correct. `rf_a.estimators` is updated in previous line, and its length is what we want for `n_estimators` – Paul Feb 07 '17 at 20:03
9

In addition to @mgoldwasser solution, an alternative is to make use of warm_start when training your forest. In Scikit-Learn 0.16-dev, you can now do the following:

# First build 100 trees on X1, y1
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X1, y1)

# Build 100 additional trees on X2, y2
clf.set_params(n_estimators=200)
clf.fit(X2, y2)
Gilles Louppe
  • 2,694
  • 1
  • 12
  • 8
  • 3
    warm_start does not seem to work when the two datasets have different numbers of labels. For example, if you have (x1, y1) where y1 can take on 3 labels, and then (x2,y2) where y2 can take on an additional label, training with warm_start fails. Swapping the order around still results in an error. – user929404 Mar 29 '15 at 16:01
  • 4
    @user929404 to point out the obvious, the model is being trained on nameless columns in a numpy array. When you initially train the model it looks to `y1` to determine how many features it's going to be training, and when you go on to train `y2` there have to be the same number of features because it can't magically understand how the variables of the first matrix line up with those of the second matrix, unless it assumes that they are the same. – v4gil Oct 05 '18 at 15:43
  • Does this method affect the order of the datasets used? If there were 3 datasets, would it make any difference if they were getting trained in different order every time? – Andreas Alamanos Nov 29 '20 at 19:03