I've started working with quantile random forests (QRFs) from the scikit-garden
package. Previously I was creating regular random forests using RandomForestRegresser
from sklearn.ensemble
.
It appears that the speed of the QRF is comparable to the regular RF with small dataset sizes, but that as the size of the data increases, the QRF becomes MUCH slower at making predictions than the RF.
Is this expected? If so, could someone please explain why it takes such a long time to make these predictions and/or give any suggestions as to how I could get quantile predictions in a more timely manner.
See below for a toy example, where I test the training and predictive times for a variety of dataset sizes.
import matplotlib as mpl
mpl.use('Agg')
from sklearn.ensemble import RandomForestRegressor
from skgarden import RandomForestQuantileRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import time
import matplotlib.pyplot as plt
log_ns = np.arange(0.5, 5, 0.5) # number of observations (log10)
ns = (10 ** (log_ns)).astype(int)
print(ns)
m = 14 # number of covariates
train_rf = []
train_qrf = []
pred_rf = []
pred_qrf = []
for n in ns:
# create dataset
print('n = {}'.format(n))
print('m = {}'.format(m))
rndms = np.random.normal(size=n)
X = np.random.uniform(size=[n,m])
betas = np.random.uniform(size=m)
y = 3 + np.sum(betas[None,:] * X, axis=1) + rndms
# split test/train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# random forest
rf = RandomForestRegressor(n_estimators=1000, random_state=0)
st = time.time()
rf.fit(X_train, y_train)
en = time.time()
print('Fit time RF = {} secs'.format(en - st))
train_rf.append(en - st)
# quantile random forest
qrf = RandomForestQuantileRegressor(random_state=0, min_samples_split=10, n_estimators=1000)
qrf.set_params(max_features = X.shape[1] // 3)
st = time.time()
qrf.fit(X_train, y_train)
en = time.time()
print('Fit time QRF = {} secs'.format(en - st))
train_qrf.append(en - st)
# predictions
st = time.time()
preds_rf = rf.predict(X_test)
en = time.time()
print('Prediction time RF = {}'.format(en - st))
pred_rf.append(en - st)
st = time.time()
preds_qrf = qrf.predict(X_test, quantile=50)
en = time.time()
print('Prediction time QRF = {}'.format(en - st))
pred_qrf.append(en - st)
fig, ax = plt.subplots()
ax.plot(np.log10(ns), train_rf, label='RF train', color='blue')
ax.plot(np.log10(ns), train_qrf, label='QRF train', color='red')
ax.plot(np.log10(ns), pred_rf, label='RF predict', color='blue', linestyle=':')
ax.plot(np.log10(ns), pred_qrf, label='QRF predict', color='red', linestyle =':')
ax.legend()
ax.set_xlabel('log(n)')
ax.set_ylabel('time (s)')
fig.savefig('time_comparison.png')
Here is the output: Time comparison of RF and QRF training and predictions