I'm trying to perform clustering in Python using Random Forests. In the R implementation of Random Forests, there is a flag you can set to get the proximity matrix. I can't seem to find anything similar in the python scikit version of Random Forest. Does anyone know if there is an equivalent calculation for the python version?
Asked
Active
Viewed 7,523 times
3 Answers
20
We don't implement proximity matrix in Scikit-Learn (yet).
However, this could be done by relying on the apply
function provided in our implementation of decision trees. That is, for all pairs of samples in your dataset, iterate over the decision trees in the forest (through forest.estimators_
) and count the number of times they fall in the same leaf, i.e., the number of times apply
give the same node id for both samples in the pair.
Hope this helps.

Gilles Louppe
- 2,694
- 1
- 12
- 8
-
How do I access the apply function? If I try: i_node = tree.apply(full_data[i]). I get "AttributeError: 'DecisionTreeClassifier' object has no attribute 'apply'" – WtLgi Sep 10 '13 at 14:24
-
It looks like this functionality is higher up in sklearn.ensemble.RandomForestClassifier. And then I don't need to iterate over all the trees? Is this correct? http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.apply Just apply one entry at a time? – WtLgi Sep 10 '13 at 14:30
-
1Indeed, sorry, `apply` is directly available in the forest, hence you don't need to iterate over the trees yourself. – Gilles Louppe Sep 11 '13 at 08:14
-
@GillesLouppe thanks! I had a follow up question regarding what the best way to visualize this proximity matrix that I posted in CrossValidated: https://stats.stackexchange.com/questions/409263/how-to-visualize-proximity-score-in-random-forests. – Yu Chen May 20 '19 at 17:12
-
Ah sorry, never mind, I realized you explain how it was created a bit later in your dissertation. – Yu Chen May 20 '19 at 19:35
17
Based on Gilles Louppe answer I have written a function. I don't know if it is effective, but it works. Best regards.
def proximityMatrix(model, X, normalize=True):
terminals = model.apply(X)
nTrees = terminals.shape[1]
a = terminals[:,0]
proxMat = 1*np.equal.outer(a, a)
for i in range(1, nTrees):
a = terminals[:,i]
proxMat += 1*np.equal.outer(a, a)
if normalize:
proxMat = proxMat / nTrees
return proxMat
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
train = load_breast_cancer()
model = RandomForestClassifier(n_estimators=500, max_features=2, min_samples_leaf=40)
model.fit(train.data, train.target)
proximityMatrix(model, train.data, normalize=True)
## array([[ 1. , 0.414, 0.77 , ..., 0.146, 0.79 , 0.002],
## [ 0.414, 1. , 0.362, ..., 0.334, 0.296, 0.008],
## [ 0.77 , 0.362, 1. , ..., 0.218, 0.856, 0. ],
## ...,
## [ 0.146, 0.334, 0.218, ..., 1. , 0.21 , 0.028],
## [ 0.79 , 0.296, 0.856, ..., 0.21 , 1. , 0. ],
## [ 0.002, 0.008, 0. , ..., 0.028, 0. , 1. ]])

Vyga
- 894
- 8
- 8