Unsupervised Random Forest Proximities in Python

Question

I am currently re-visiting a random forests project I performed a few years back using the R-language, to:

generate a proximity matrix of the data inputs using unsupervised RandomForest
calculate the distance matrix from this proximity matrix and pass to Partitioning Around Medoids (PAM) clustering algorithm
using the clusters obtained through PAM, run RandomForest in supervised mode to train a new model.
Use this model to predict using another dataset from a future point in time.

I have shifted my workflow to Python for much of many projects as the language is very flexible and fun, but I am still getting my bearings in sklearn as compared to how I performed such tasks in R. My hangup is in producing a proximity matrix (or some container holding the proximity between samples), to be passed to PAM. I have found the following post, which describes a similar issue, but I have been unable to find a way to implement what the accepted answer's author suggests.

Any clues as to how to implement this? Any help is be greatly appreciated, and I will be sure to return that to the larger community. I know there are lots of other R to Python converts out there who would benefit from this sort of information.

Thanks in advance and apologies if this is a simple solution that I am simply overlooking.

Any progress on this? No one really described how to implement this in Python with sklearn. — O.rka, Nov 21 '18 at 14:04

Soroosh · Answer 1 · 2015-07-23T20:36:39.373

You can use bigrf package written in R. ( https://cran.r-project.org/web/packages/bigrf/bigrf.pdf ) It has whatever you need.

That is how you can implement it in R:

# load bigrf library
library('bigrf')

# generate synthetic dataset
synthetic.df <- generateSyntheticClass(x)

# create rf model
forest <- bigrfc(synthetic.df$x, synthetic.df$y, trace = 1)

# calculate distances
dist  <- proximities(forest, trace =  2)
dist  <- data.frame(as.matrix(dist))
dist  <- dist[1:nrow(x), 1:nrow(x)]
dist  <- sqrt(1 - dist)

score 0 · Answer 2 · answered Jul 23 '15 at 18:22

First of all, you might want to check out pandas: http://pandas.pydata.org/. It may make your life much easier.

For solution using python data structures, it will really depend on how you're loading the data and what you're doing with it afterwards (e.g. what your PAM method needs).

One convenient way of storing distances is an adjacency list. There are many ways to implement this. I like to use a hash where the keys are coordinate tuples and the values are distance.

a = {}
a[(0,1)] = 7
a[(1,5)] = 20
a[(6,1)] = 1

This is for 2 dimensions, but you can go higher by giving the keys more coordinates.

Unsupervised Random Forest Proximities in Python

2 Answers2