5

I have been trying to build a RandomForestClassifier() (RF) model and a DecisionTreeClassifier() (DT) model in order to get the same output (only for learning purposes). I have found some questions with answers where I used those answers to build this code, like the required parameters to make both models equal but I can't find a code that actually does it, so I'm trying build that code:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

random_seed = 42

X, y = make_classification(
    n_samples=100000,
    n_features=5,
    random_state=random_seed
)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=random_seed)

DT = DecisionTreeClassifier(criterion='gini',             # default
                            splitter='best',              # default
                            max_depth=None,               # default
                            min_samples_split=3,          # default
                            min_samples_leaf=1,           # default
                            min_weight_fraction_leaf=0.0, # default
                            max_features=None,            # default
                            random_state=random_seed,     # NON-default
                            max_leaf_nodes=None,          # default
                            min_impurity_decrease=0.0,    # default
                            class_weight=None,            # default
                            ccp_alpha=0.0                 # default
                           )
DT.fit(X_train, y_train)

RF = RandomForestClassifier(n_estimators=1,               # NON-default
                            criterion='gini',             # default
                            max_depth=None,               # default
                            min_samples_split=3,          # default
                            min_samples_leaf=1,           # default
                            min_weight_fraction_leaf=0.0, # default
                            max_features=None,            # NON-default
                            max_leaf_nodes=None,          # default 
                            min_impurity_decrease=0.0,    # default
                            bootstrap=False,              # NON-default
                            oob_score=False,              # default 
                            n_jobs=None,                  # default
                            random_state=random_seed,     # NON-default
                            verbose=0,                    # default
                            warm_start=False,             # default
                            class_weight=None,            # default
                            ccp_alpha=0.0,                # default
                            max_samples=None              # default
                           )

RF.fit(X_train, y_train)

RF_pred =  RF.predict(X_test)
RF_proba = RF.predict_proba(X_test)
DT_pred =  DT.predict(X_test)
DT_proba = DT.predict_proba(X_test)


# Here we validate that the outputs are actually equal, with their respective percentage of how many rows are NOT equal
print('If DT_pred = RF_pred:',np.array_equal(DT_pred, RF_pred), '; Percentage of not equal:', (DT_pred != RF_pred).sum()/len(DT_pred))
print('If DT_proba = RF_proba:', np.array_equal(DT_proba, RF_proba), '; Percentage of not equal:', (DT_proba != RF_proba).sum()/len(DT_proba))

# A plot that shows where those differences are concentrated
sns.set(style="darkgrid")
mask = (RF_proba[:,1] - DT_proba[:,1]) != 0
only_differences = (RF_proba[:,1] - DT_proba[:,1])[mask]
sns.kdeplot(only_differences, shade=True, color="r")
plt.title('Plot of only differences in probs scores')
plt.show()

Output:

enter image description here

I even found an answer that compares an XGBoost with DecisionTree saying they are almost identical, and when I test their probabilities outputs they are fairly different.

So, am I doing something wrong here? How can I get the same probabilities for those two models? Is there a possibility to get True for those two print() statements in the code above?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Chris
  • 2,019
  • 5
  • 22
  • 67

1 Answers1

2

It appears to be due to random states, despite your best efforts. For the random forest to be effective at its randomization, it needs to provide each component decision tree with a different random state (using sklearn.ensemble._base._set_random_states, source). You can check in your code that while RF.random_state and DT.random_state are both 42, RF.estimators_[0].random_state is 1608637542.

When bootstrap=False and max_columns=None, this is only changing some effects for tied-gain splits I believe, and so the results are very close on the training set. That can translate to slightly larger differences on a test set.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • Excellent! Is there a way to modify that random state of the unique tree of RF to 42? Like `RF.estimators_[0].random_state = 42`? In addition, could you share the code of your initial observations? It could help me to visualize and learn how the probabilities are finally computed (the actual math) from the data.. which is my final objective. – Chris Apr 12 '22 at 16:28
  • You can just set the random state, but if you refit the forest it will reset the random state; you could just refit the tree itself, but maybe that's less satisfying? As for the initial observations, I realized I was being misled: the nodes needn't match up by index. A better way to do this is to use the `decision_path` method together with the underlying `tree` attributes `feature` and `threshold`. – Ben Reiniger Apr 22 '22 at 01:52