while using the RandomForestRegressor I noticed something strange. To illustrate the problem, here a small example. I applied the RandomForestRegressor on a test dataset and plotted the graph of the first tree in the forest. This gives me the following output:
Root_node:
mse=8.64
samples=2
value=20.4
Left_leaf:
mse=0
samples=1
value=24
Right_leaf:
mse=0
samples=1
value=18
First, I expected the root node to have a value of (24+18)/2=21
. But somehow it is 20.4.
However, even if this value is correct, how do I get a mse of 8.64?
From my point of view it is supposed to be: 1/2[(24-20.4)^2+(18-20.4)^2]=9.36
(under the assumption that the root value of 20.4 is correct)
My solution is: 1/2[(24-21)^2+(18-21)^2]=9
. This is also what I get if I just use the DecisionTreeRegressor.
Is there something wrong in the implementation of the RandomForestRegressor or am I completely wrong?
Here is my reproducible code:
import pandas as pd
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
import graphviz
# create example dataset
data = {'AGE': [91, 42, 29, 94, 85], 'TAX': [384, 223, 280, 666, 384], 'Y': [19, 21, 24, 13, 18]}
df = pd.DataFrame(data=data)
x = df[['AGE','TAX']]
y = df[['Y']]
rf_reg = RandomForestRegressor(max_depth=2, random_state=1)
rf_reg.fit(x,y)
# plot a single tree of forest
dot_data = tree.export_graphviz(rf_reg.estimators_[0], out_file=None, feature_names=x.columns)
graph = graphviz.Source(dot_data)
graph
and the output graph: