Accessing individual leaves in randomForest

Question

I'm using the package quantregForest in R, which is based on randomForest, to generate forecast intervals from a set of predictors.

After training the algorithm on some data, it outputs a quantile-based prediction interval for each set of predictors in the test data. As I understand, each leaf (or terminal node) in the random forest which is generated, represents a distribution of values. How can I access the values which make up each of the leaves (terminal nodes) in the forest?

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Mar 17 '22 at 18:39

Reid Johnson · Answer 1 · 2022-03-28T15:37:57.817

I understand that you're using the R-based quantregForest package at the moment. I'm not well-versed with this package, but I'll provide an answer to your question with the quantile-forest package, which is a comparable Python-based implementation of Quantile Regression Forests. You may be able to produce your desired outcome in Python; if not, concepts discussed here may translate to the quantregForest implementation. A quantile regression forest must store the training response values (or a mapping thereof) in the leaf nodes, so it should be conceptually possible to retrieve the values in any canonical implementation. I'll speculate on how this might be accomplished with the quantregForest package at the end of my answer.

Extracting Leaf Values from quantile-forest Implementation

As of v1 of the quantile-forest package, the training sample response (y) values are stored in a model.forest_.y_train list object, and a mapping of training sample indices to leaf nodes is stored in a model.forest_.y_train_leaves object, which is a 3-dimensional matrix/array of shape (n_estimators, max_n_leaves, max_n_leaf_samples). The training mapping uses 1-indexed values (as opposed to the original 0-indexing used by Python) so that the object can be stored as a sparse array (with 0 representing unused elements, rather than the first training sample). Altogether, then, to retrieve the values that make up a leaf, one needs to access a leaf index in the mapping object, subtract 1 from the index, and use the resulting non-negative values as indices to the stored response values.

Code Examples Using quantile-forest Implementation

Here's an example that puts these details together in order to access the values in a particular leaf:

import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split

X, y = datasets.fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

qrf = RandomForestQuantileRegressor(random_state=0)
qrf.fit(X_train, y_train)

# Get the training indices for tree=0, leaf=18683.
y_train_leaves = np.asarray(qrf.forest_.y_train_leaves)
train_indices = y_train_leaves[0, 18683, :] - 1
train_indices = train_indices[train_indices >= 0]

# Get the training response values for the training indices
print(np.array(qrf.forest_.y_train)[train_indices])

The above example shows how to access an individual leaf node. You could loop over each node for each tree in order to access the leaf values across the full ensemble. So continuing from the above example:

n_trees, n_nodes, _ = y_train_leaves.shape
for tree_i in range(n_trees):
    for node_j in range(n_nodes):
        train_indices_ij = y_train_leaves[tree_i, node_j] - 1
        train_indices_ij = train_indices_ij[train_indices_ij >= 0]
        print(np.array(qrf.forest_.y_train)[train_indices_ij])

Note that the above looping will include non-terminal nodes; leaf nodes will be those nodes with non-empty lists.

That said, depending on your desired goal here, there may be further convenience functions that can help. For example, if you want to find which samples share leaf nodes (known as proximities), the package has a proximity_counts function that can do this. Here's an example of using that function to get the values of every training sample that shares a leaf with the first test sample:

proximities = qrf.proximity_counts(X_test)
prox_indices = np.array([x[0] for x in proximities[0]])
print(np.array(qrf.forest_.y_train)[prox_indices])

This function could be used, for example, to get the response values that are used to calculate the quantile(s) for particular samples or to count the number of times that pairs of samples reside in the same leaf node.

Applying the Above Concepts to quantregForest Implementation

I'm not intimately familiar with the quantregForest package, but a brief look at the code suggests similarities to the above. The corollary to the y_train_leaves object appears to be valuesNodes. However, it's worth noting that it appears to store the response values directly (rather than a mapping to a separate list of values) and only appears to store 1 value per leaf node. Given these caveats, though, you should be able to use this object to retrieve the values that make up each of the leaf nodes.

Accessing individual leaves in randomForest

1 Answers1

Extracting Leaf Values from quantile-forest Implementation

Code Examples Using quantile-forest Implementation

Applying the Above Concepts to quantregForest Implementation